Static Analysis of String Encoders and Decoders Presented By: Loris D’Antoni Joint work with:...
-
Upload
domenic-daniel -
Category
Documents
-
view
215 -
download
0
Transcript of Static Analysis of String Encoders and Decoders Presented By: Loris D’Antoni Joint work with:...
Static Analysis of String Encoders and Decoders
Presented By:
Loris D’Antoni
Joint work with:
Margus Veanes
2
Motivation
• String Encoders and Decoders are– Ubiquitous: transformation from Unicode text files
in the Internet to in-memory representation of text– Hard to write: they use unintuitive logic in order to
enable efficiency– Hard to verify: big state space, alphabets are very
big (216 elements). Previous techniques blow up for small decoders.
3
A simple example: BASE64 encoder
3 Bytes 4 Base64 characters
• Decoder similar (every 4 encodes 3)• Uses bit manipulations to be efficient• How do we model it and prove it correct?
Text content M a n
Bytes 77 97 110
Bit Pattern 0 1 0 0 1 1 0 1 0 1 1 0 0 0 0 1 0 1 1 0 1 1 1 0
Index 19 22 5 46
Base64 Encoded T W F u
4
What Properties do we check?
Encoder, Decoder denoted by E,D• E o D = I • D o E = I• dom(E) = bytes• dom(D) = Base64 bytes
• We need– Equivalence checking– Function Composition (our model should be closed
under composition)
5
Bek code
program base64encode(input){ return iter(x in input)[q:=0;r:=0;]{ case (x>0xFF): raise InvalidCharacter; case (q==0): yield (base64(x>>2)); q:=1; r:=(x&3)<<4; case (q==1): yield (base64(r|(x>>4))); q:=2; r:=(x&0xF)<<2; case (q==2): yield (base64((r|(x>>6))), base64(x&0x3F)); q:=0;
r:=0; end case (q==1): yield (base64(r),'=','='); end case (q==2): yield (base64(r),'='); };}
How do we analyze this code?
6
Trust me! It is tricky![12/12/12 11:35:49 PM] Margus Veanes: I think it is doable, smth that is like ([A-Z2-7]{4}... )*[12/12/12 11:35:57 PM] Loris D'Antoni: ok ill try[12/12/12 11:36:22 PM] Margus Veanes: then you can ry to see the difference compared to the domain of the decoder[12/12/12 11:37:42 PM] Loris D'Antoni: it seems that also on this counterex it doesn't work[12/12/12 11:37:43 PM] Loris D'Antoni: DP2A====[12/12/12 11:37:50 PM] Loris D'Antoni: which maybe it's a bad one in this sense[12/12/12 11:37:52 PM] Loris D'Antoni: ill check now[12/12/12 11:40:45 PM] Margus Veanes: actually the domain of the decoder looks wrong, it allows 8 and 9[12/12/12 11:40:46 PM] Margus Veanes: http://www.rise4fun.com/Bek/Cy3[12/12/12 11:40:58 PM] Loris D'Antoni: yeh i fixed that in my version
…COUPLE OF HACKS LATER…
[12/13/12 12:24:02 AM] Loris D'Antoni: ok, found bug and fixed it, now proved them correct. Will work on others tomorrow. Was very silly but hard to spot[12/13/12 12:24:35 AM] Margus Veanes: ... this is why the analysis we can do is useful :-)[12/13/12 12:24:45 AM] Loris D'Antoni: yeh i was mapping[12/13/12 12:24:46 AM] Loris D'Antoni: 26..31 ==> 2..7[12/13/12 12:24:58 AM] Loris D'Antoni: instead of 26..31 ==> '2'..'7'
Brief DEMO
7
Attempt 1: Finite Transducers
M Ma
n / [TWFu]
M / []a / []
…..…..
…..
• Finite set of states• Each transition reads an input symbol and outputs a
sequence of symbols• Mapping from strings into strings• Blue state (final), for which the mapping is defined
28 edges out of every state and 216 states
Decidable equivalence and
closure under composition
8
Attempt 2: Symbolic Finite Transducers [POPL12]
M Ma
λx. x==‘M’ / [λx. x>>2]
λx. x==‘a’ / [λx. x>>4,…]
…..…..
• Guards are predicates over any decidable theory instead of single characters
• Output is a function of the input• In this case uses theory of bit-vectors• Better reflects implementation operations• Analysis is still decidable (equivalence, composition)• We did not improve much: still state explosion
Supports symbolic updates such as bit-
vectors
9
Attempt 3: Symbolic Transducers [POPL12]
1 2
True / [r|(x>>6), x&0x3F],r := 0
True / [x>>2], r := (x&3)<<4
True / [r|(x>>4)],r := (x&0xF)<<2
0
• Register can store values and is updated in transitions• Inputs and outputs can inspect and use register value• Logic is the same as for implementation!!• No state explosion!!• Closed under sequential composition• Analysis (equivalence) is undecidable in general…• We need a way to eliminate the registers
Registers
Registers
10
Register Elimination: the naïve way
1 2
x / [r|(x>>6), x&0x3F],r := 0
x / [x>>2], r := (x&3)<<4
x / [r|(x>>4)],r := (x&0xF)<<2
0
M Ma
n / [(((((M&3)<<4)|(a>>4))&0xF)<<2)|(n>>6), n&0x3F]
M / [M>>2]a / [((M&3)<<4)|(a>>4)]
…..…..
Via enumeration: State Explosion, but automatic
Can do analysis, but very slow…
Doesn’t work if alphabet infinite:
waste of Symbolic analysisWe need a Better model
ST
SFT
11
Text content M a n
Byte 77 97 110
Bit Pattern 0 1 0 0 1 1 0 1 0 1 1 0 0 0 0 1 0 1 1 0 1 1 1 0
Index 19 22 5 46
Base64 Encoded T W F u
A simple example: BASE64
3 Bytes 4 Base64 characters
• Decoder similar (every 4 encodes 3)• Uses bit manipulations to be efficient• How do we model it and prove it correct?
12
Extended Symbolic Finite Transducers[x1,x2,x3] /
[x1>>2, ((x1&3)<<4)|(x2>>4), ((x2&0xF)<<2)|(x3>>6), x3&0x3F]
0
• No state explosion• Analysis can be done for several interesting
cases (in particular for encoders)• But, how do we pass from STs to ESFTs?
Read sequences of symbols
Output is a function of all the 3 symbols
Register Elimination: the good way 1/2
1 2
x / [r|(x>>6), x&0x3F],r := 0
x / [x>>2], r := (x&3)<<4
x / [r|(x>>4)],r := (x&0xF)<<2
0
[x1,x2] / [r|(x1>>4), ((x1&0xF)<<2)|(x2>>6), x2&0x3F], r:=0
0
ST
ESFT
13
1x / [x>>2],
r := (x&3)<<4
2
Register Elimination: the good way 2/2
[x1,x2,x3] / [x1>>2, ((x1&3)<<4)|(x2>>4), ((x2&0xF)<<2)|(x3>>6), x3&0x3F]
0
Fast and supports infinite alphabets
Not always possible, but works for
encoders/decoders
14
[x1,x2] / [r|(x1>>4), ((x1&0xF)<<2)|(x2>>6), x2&0x3F], r:=0
0 1x / [x>>2],
r := (x&3)<<4
1
15
Composition of ESFTs
ESFT E
ESFT D
ST E’
ST D’
ST E’oD’ ESFT EoD
Use of registers to remember values
Uses ST closure under composition
Register elimination
Not closed in general
16
Equivalence Semi-Decision ProcedureFirst we check equivalence on domain intersection (hard)
then we check domain equivalence (easier in this case).
(λ(x1,x2).True)/[x1,x2] λ(x).True/ [x]
10
0,1
λ(x1,x2).True/([x1,x2],[x1,x2])
We build a product
transducer
17
Unicode Case Study
• We analyzed UTF8 to UTF16 encoder (E) and decoder (D)
Test Running Time
Dom(E) = UTF16 47 ms
Dom(EoD) = UTF16 109 ms
Dom(D) = UTF8 156 ms
Dom(DoE) = UTF8 320 ms
EoD=Identity (naive) 82,000 ms
DoE=Identity (naive) 134,000 ms
EoD=Identity (new algorithm) 123 ms
DoE=Identity (new algorithm) 215 ms
Complete analysis in less than a second
18
Result Summary
• ESFTs a new transducer model for representing encoders and decoders
• A new register elimination algorithm from ST to ESFTs, independent from input alphabet
• Correctness analysis of real programs: Unicode, Base64 encoders and decoders
• Automatic Javascript code generation of the verified code
• Check it out http://rise4fun.com/Bek/ • Transducers are cool!!
19
Future Work
• Understand the theory of ESFTs (coming soon)– Composition closure, equivalence…
• Extend the model to tree transformations– Widely used in NLP
• Analyze more complex scenarios– List manipulating programs