Fast and Precise Sanitizer Analysis with BEK · 2018-01-04 · ing to spend a little more time on...

Fast and Precise Sanitizer Analysis withBEK

Pieter HooimeijerUniversity of Virginia

Benjamin LivshitsMicrosoft Research

David MolnarMicrosoft Research

Prateek SaxenaUC Berkeley

Margus Veanes∗

Microsoft Research

Abstract

Web applications often use special string-manipulatingsanitizerson untrusted user data, but it is difficult to rea-son manually about the behavior of these functions, lead-ing to errors. For example, the Internet Explorer cross-site scripting filter turned out to transform some webpages without JavaScript into web pages with valid Java-Script, enabling attacks. In other cases, sanitizers mayfail to commute, rendering one order of application safeand the other dangerous.

BEK is a language and system for writing sanitiz-ers that enables precise analysis of sanitizer behavior,including checking idempotence, commutativity, andequivalence. For example, BEK can determine if a tar-get string, such as an entry on the XSS Cheat Sheet, isa valid output of a sanitizer. If so, our analysis synthe-sizes an input string that yields that target. Our languageis expressive enough to capture real web sanitizers usedin ASP.NET, the Internet Explorer XSS Filter, and theGoogle AutoEscape framework, which we demonstrateby porting these sanitizers to BEK.

Our analyses use a novelsymbolic finite automatarepresentation to leverage fast satisfiability modulo the-ories (SMT) solvers and are quick in practice, tak-ing fewer than two seconds to check the commutativ-ity of the entire set of Internet Exporer XSS filters,between 36 and 39 seconds to check implementationsof HTMLEncode against target strings from the XSSCheat Sheet, and less than ten seconds to check equiv-alence between all pairs of a set of implementations ofHTMLEncode. Programs written in BEK can be compiledto traditional languages such as JavaScript and C#, mak-ing it possible for web developers to write sanitizers sup-ported by deep analysis, yet deploy the analyzed codedirectly to real applications.

1 Introduction

Cross site scripting (“XSS”) attacks are a plague in to-day’s web applications. These attacks happen becausethe applications take data from untrusted users, and thenecho this data to other users of the application. Becauseweb pages mix markup and JavaScript, this data maybe interpreted as code by a browser, leading to arbitrarycode execution with the privileges of the victim. The first

∗Authors are listed alphabetically. Work done while P. Hooimeijerand P. Saxena were visiting Microsoft Research.

line of defense against XSS is the practice ofsanitiza-tion, where untrusted data is passed through asanitizer,a function that escapes or removes potentially danger-ous strings. Multiple widely used Web frameworks offersanitizer functions in libraries, and developers often addadditional custom sanitizers due to performanceor functionality constraints.

Unfortunately, implementing sanitizerscorrectly issurprisingly difficult. Anecdotally, in dozens of code re-views performed across various industries, just about anycustom-written sanitizer was flawed with respect to secu-rity [38]. The recent SANER work, for example, showedflaws in custom-written sanitizers used by ten web ap-plications [9]. For another example, several groups ofresearchers have found specially crafted pages that donot initially have cross site scripting attacks, but whenpassed through anti-cross-site scripting filters yield webpages that cause JavaScript execution [10, 22].

The problem becomes even more complicated whenconsidering that a web application maycomposemulti-ple sanitizers in the course of creating a web page. Ina recent empirical analysis, we found that a large webapplication often applied the same sanitizers twice, de-spite these sanitizers not being idempotent. This analysisalso found that the order of applying different sanitizerscould vary, which is safe only if the sanitizers are com-mutative [32], providing further evidence suggesting thatdevelopers have a difficult time writing correct sanitiza-tion functions without assistance.

Despite this, much work in the space of detecting andpreventing XSS attacks [19, 23, 25, 27, 39] has optimisti-cally assumed that sanitizers are in fact both known andcorrect. Some recent work has started exploring the is-sue of specification completeness [24] as well as san-itizer correctness by explicitly statically modeling setsof values that strings can take at runtime [13, 26, 36, 37].These approaches use analysis-specific models of stringsthat are based on finite automata or context-free gram-mars. More recently, there has been significant interestin constraint solving tools that model strings [11, 17, 18,20, 31, 34, 35]. String constraint solvers allow any clientanalysis to express constraints (e.g., path predicates forasingle code path) that include commonstring manipulation functions.

Sanitizers are typically a small amount of code, per-haps tens of lines. Furthermore, application developersknow when they are writing a new, custom sanitizer or setof sanitizers. Our key proposition is that if we are will-

ing to spend a little more time on this sanitizer code, wecan obtain fast and precise analyses of sanitizer behavior,along with actual sanitizer code ready to be integratedinto both server- and client-side applications. Our ap-proach is BEK, a language for modeling string transfor-mations. The language is designed to be (a) sufficientlyexpressive to model real-world code, and (b) sufficientlyrestricted to allow fast, precise analysis, without needingto approximate the behavior of the code.

Key to our analysis is a compilation from BEK pro-grams tosymbolic finite state transducers, an extensionof standard finite transducers. Recall that a finite trans-ducer is a generalization of deterministic finite automatathat allows transitions from one state to another to be an-notated withoutputs: if the input character matches thetransition, the automaton outputs a specified sequence ofcharacters. In a symbolic finite transducer, transitionsare annotated with logicalformulas instead of specificcharacters, and the transducer takes the transition on anyinput character that satisfies the formula. We apply algo-rithms that determine if two BEK programs are equiva-lent. We also can check if a BEK program can output aspecific string, and if so, synthesize an inputyielding that string.

Our symbolic finite state transducer representationenables leveragingsatisfiability modulo theories (SMT)solvers, tools that take a formula and attempt to find in-puts satisfying the formula. These solvers have becomerobust in the last several years and are used to solve com-plicated formulas in a variety of contexts. At the sametime, our representation allows leveraging automata the-oretic methods to reason about strings of unboundedlength, which is not possible via direct encoding to SMTformulas. SMT solvers allow working with formulasfrom any theory supported by the solver, while otherprevious approaches using binary decision diagrams arespecialized to specific types of inputs.

After analysis, programs written in BEK can be com-piled back to traditional languages such as JavaScript orC# . This ensures that the code analyzed and tested isfunctionally equivalent to the code which is actually de-ployed for sanitization, up to bugs in our compilation.

This paper contains a number of experimental casestudies. We conclusively demonstrate that BEK is ex-pressive enough for a wide variety of real-life code byconverting multiple real world Web sanitization func-tions from widely used frameworks, including those usedin Internet Explorer 8’s cross-site scripting filter, to BEK

programs. We report on which features of the BEK lan-guage are needed and which features could be addedgiven our experience. We also examine other code,such as sanitizers from Google AutoEscape and func-tions from WebKit, to determine whether or not they canbe expressed as BEK programs. We maintain samples ofBEK programs online1.

1http://code.google.com/p/bek/

We then use BEK to perform security specific analy-ses of these sanitizers. For example, we use BEK to de-termine whether there exists an input to a sanitizer thatyields any member of a publicly available database ofstrings known to result in cross site scripting attacks. Ouranalysis is fast in practice; for example, we take two sec-onds to check the commutativity of the entire set of In-ternet Explorer 8 XSS filters, and less than 39 seconds tocheck an implementations theHTMLEncode sanitizationfunction against target strings from theXSS Cheat Sheet [5].

To experimentally demonstrate the difficulty of writ-ing correct sanitizers, we hired several freelance devel-opers to implementHTMLEncode functionality. UsingBEK, we checked theequivalenceof the seven differ-ent implementations ofHTMLEncode and used BEK tofind counterexamples: inputs on which these sanitizersbehave differently. Finally, we performed scalability ex-periments to show that in practice the time to performBEK analyses scales near-linearly.

1.1 Contributions

The primary contributions of this paper are:

• Language. We propose a domain-specific lan-guage, BEK, for string manipulation. We describe asyntax-driven translation from BEK expressions tosymbolic finite state transducers.

• Algorithms. We provide algorithms for performingcomposition computation and equivalence check-ing, which enables checking commutativity, idem-potence, and determining if target strings can beoutput by a sanitizer. We show how JavaScript andC# code can be generated out of BEK programs,streamlining the client- and server-side deploymentof BEK sanitizers.

• Evaluation. We show that BEK can encode real-world string manipulating code used to sanitize un-trusted inputs in web applications. We demonstratethe expressiveness of BEK by encoding OWASPsanitizers, many IE 8 XSS filters, as well as func-tions written by freelance developers hired throughodesk.com andvworker.com for our experimentspresented in this paper. We show how the analy-ses supported by our tool can find security-criticalbugs or check that such bugs do not exist. Toimprove the end-user experience when a bug isfound, BEK produces a counter-example. We dis-cover that only 28.6% of our sanitizers commute,∼79.1% are idempotent, and that only 8% are re-versible. We also demonstrate that most hand-written HTMLEncode implementations disagree onat least some inputs.

• A Scalable Implementation.BEK deals with Uni-code strings without creating a state explosion. Fur-thermore, we show that our algorithms for equiv-alence checking and composition computation are

Figure 1: BEK architecture. We use a representationbased onsymbolic finite state transducers(defined in-text) to model string sanitization code without approxi-mation.

very fast in practice, scaling near-linearly with thesize of the symbolic finite transducer representation.The main reason for this is the symbolic representa-tion of the transition relation.

While the focus of this paper is on XSS attacks2, ourlanguage and analyses are more general and apply toany string manipulating function. For example Chenetal. check interactions between firewall rules, finding re-dundant and order-dependent rules in routers [40]. Choand Babic [12] check the equivalence between a specifi-cation and an implementation forstate machines in SMTP servers.

2 Overview

Figure 1 shows an architectural diagram for the BEK sys-tem. At the center of the picture is the transducer-basedrepresentation of a BEK program. At the moment, wesupport a BEK language front end, although other frontends that convert Java or C# programs into BEK are alsopossible. We provide motivating examples of the BEK

language in Section 2.1 and discuss the applications ofBEK to analyzing sanitizers in Section 2.2.

2.1 Introductory Examples

Example 1. The following BEK program is a basic san-itizer that backslash-escapes single and double quotes(but only if they are not escaped already). Theiter con-struct is a block that uses a character variablec and asingle boolean state variableb that is initially f (false).Each iteration of the block binds the character variable toa single character of the stringt; iteration continues un-til no more characters remain. The block is broken into

2The dual of the issue of code injection is data privacy; BEK isequally suitable to analyzing the corresponding data cleansing func-tions.

private static string EncodeHtml(string t)

{

if (t == null) { return null; }

if (t.Length == 0) { return string.Empty; }

StringBuilder builder =

new StringBuilder("", t.Length * 2);

foreach (char c in t)

{

if ((((c > ’‘’) && (c < ’{’)) ||

((c > ’@’) && (c < ’[’))) || (((c == ’ ’) ||

((c > ’/’) && (c < ’:’))) || (((c == ’.’) ||

(c == ’,’)) || ((c == ’-’) || (c == ’_’))))){

builder.Append(c);

} else {

builder.Append("&#" +

((int) c).ToString() + ";");

}

}

return builder.ToString();

}

Figure 2: Code forAntiXSS.EncodeHtml version 2.0.

case statements. If a character satisfies the condition ofthe case statement, the corresponding code is executed.Hereyield(c) outputs the current characterc.

iter(c in t) {b := f ; } {

case(¬(b) ∧ (c = ‘’’ ∨ c = ‘"’)) {

b := f ; yield(‘\’); yield(c); }

case(c = ‘\’) {

b := ¬(b); yield(c); }

case(t) {

b := f ; yield(c); }

}

The boolean variableb is used to track whether the previ-ous character seen was an unescaped slash. For example,in the input\\" the double quote is not considered es-caped, and the transformed output is\\\". If we apply theBEK program to\\\" again, the output is the same. Aninteresting question is whether this holds for any outputstring. In other words, we may be interested in whethera given BEK program isidempotent.

If implemented incorrectly, double applications ofsuch sanitization functions can result in duplicate escap-ing. This in turn has led to command injection of script-injection attacks in the past. Therefore, checkingidem-potenceof certain functions is practically useful. We willsee in the next section how BEK canperform such checks. �

Example 2. The code in Figure 2 is from the publicMicrosoft AntiXSS library. The sanitizer iterates overthe input character-by-character. Depending on the char-acter encountered, a different action is taken, such as in-cluding the character verbatim or encoding it in somemanner, such as numeric HTML escaping.

The BEK program corresponding toEncodeHtml is

iter (c in t){case (¬ϕ(c)){yield [‘&’,‘#’] + dec(c) + [‘;’]; }

case(true){yield [c]; }}

wheredec is a built-in library function that returns thedecimal representation of the character andϕ(c) is theformula

(‘a’ ≤ c ∧ c ≤ ‘z’) ∨ (‘A’ ≤ c ∧ c ≤ ‘Z’) ∨(‘0’ ≤ c ∧ c ≤ ‘9’) ∨ c = ‘ ’ ∨ c = ‘.’ ∨c = ‘,’ ∨ c = ‘−’ ∨ c = ‘ ’

The BEK program iterates over each character of theinput. If the character satisfies the formulaϕ(c), then theprogram outputs the character. Otherwise the programescapes the character by outputting its decimal encod-ing, together with the&# prefix and semicolon. Notethat this sanitizer is not idempotent, because applying thefunction twice to the string&# will result in double es-caping. Our tool can detect this in under a second.�

Multiple implementations may exist of the “same”sanitizer. For example, Figure 3 shows the result of run-ning the Red Gate Reflector .NET decompiler on the Sys-tem.NET implementation ofEncodeHTML. We have con-verted this code to BEK as well, noticing that thegotostructure is the result of a loop after decompilation. Us-ing our analyses, we can check these implementations forequivalence. Our implementation can detect in less thanone second that the System.NET implementation doesnot escape single quote characters, while the AntiXSSimplementation does, meaning that the two implementa-tions are not equivalent. Failure to escape single quotescan lead to XSS attacks, so thisdifference is significant [33].

2.2 Security Applications

Web sanitizers are the first line of defense against cross-site scripting attacks for web applications: they are func-tions applied to untrusted data provided by a user thatattempt to make the data “safe” for rendering in a webbrowser. Reasoning about the security properties of websanitizers is crucial to the security of web applicationsand browsers. Formal verification of sanitizers is there-fore crucial in proving the absence of injection attackssuch as cross-site and cross-channel scripting as well asinformation leaks.

2.2.1 Security of Sanitizer Composition

Recent work has demonstrated that developers mayaccidentally compose sanitizers in ways that are notsafe [32]. BEK can check two key properties of sanitizercomposition: commutativity and idempotence.

public static string EncodeHtml(string s)

{

if (s == null)

return null;

int num = IndexOfHtmlEncodingChars(s, 0);

if (num == -1)

return s;

StringBuilder builder=new StringBuilder(s.Length+5);

int length = s.Length;

int startIndex = 0;

Label_002A:

if (num > startIndex)

{

builder.Append(s, startIndex, num-startIndex);

}

char ch = s[num];

if (ch > ’>’)

{

builder.Append("&#");

builder.Append(((int) ch).

ToString(NumberFormatInfo.InvariantInfo));

builder.Append(’;’);

}

else

{

char ch2 = ch;

if (ch2 != ’"’)

{

switch (ch2)

{

case ’<’:

builder.Append("<");

goto Label_00D5;

case ’=’:

goto Label_00D5;

case ’>’:

builder.Append(">");

goto Label_00D5;

case ’&’:

builder.Append("&");

goto Label_00D5;

}

}

else

{

builder.Append(""");

}

}

Label_00D5:

startIndex = num + 1;

if (startIndex < length)

{

num = IndexOfHtmlEncodingChars(s, startIndex);

if (num != -1)

{

goto Label_002A;

}

builder.Append(s, startIndex, length-startIndex);

}

return builder.ToString();

}

Figure 3: Code forEncodeHtml from version 2.0 ofSystem.Net. This code is not equivalent to the AntiXSSlibrary version.

Commutativity: Consider two default sanitizers inthe Google CTemplate framework:JavaScriptEscapeand HTMLEscape [4]. The former performs Uni-code encoding (\u00XX) for safely embedding untrusteddata in JavaScript strings while the latter sanitizer per-forms HTML entity-encoding (<) for embedded un-trusted data in HTML content. It turns out that ifJavaScriptEscape is applied to untrusted data beforethe application ofHTMLEscape, certain XSS attacks arenot prevented [32]. The opposite ordering does preventthese attacks. BEK can check if a pair of sanitizers arecommutative, which would mean the programmer doesnot need to worry about this class of bugs.

Idempotence: BEK can check if applying the sanitizertwice yields different behavior from a single application.For example, an extra JavaScript string encoding maybreak the intended rendering behavior in the browser.

2.2.2 Sanitizer Implementation Correctness

Hand-coded sanitizers are notoriously difficult to writecorrectly. Analyses provided by BEK help achieve cor-rectness in three ways.

Comparing multiple sanitizer implementations: Mul-tiple implementations of the same sanitization function-ality can differ in subtle ways [9]. BEK can checkwhether two different programs written in the BEK lan-guage are equivalent. If they are not, BEK exhibits inputsthat yield different behaviors.

Comparing sanitizers to browser filters: Internet Ex-plorer 8 and 9, Google Chrome, Safari, and Firefox em-ploy built-in XSS filters (or have extensions [3]) that ob-serve HTTP requests and responses [1, 2] for attacks.These filters are most commonly specified as regularexpressions, which we can model with BEK. We canthen check for inputs that are disallowed by browser fil-ters, but which are allowed by sanitizers. For example,BEK can determine that the AntiXSS implementation ofthe EncodeHTML sanitizer in Figure 2 does not blockstrings such asjavascript: which are prevented byIE 8 XSS filters. These differences indicate potentialbugs in the sanitizer or the filter.

Checking against public attack sets: Several pub-lic XSS attack sets are available, such as XSS cheatsheet [5]. With BEK, for all sanitizers, for all attack vec-tors in an attack set, we can check if there exists an inputto the sanitizer that yields the attack vector.

3 The BEK Language and Transducers

In this section, we give a high-level description of asmall imperative language, BEK, of low-level string op-erations. Our goal is two-fold. First, it should be possibleto model BEK expressions in a way that allows for theiranalysis using existing constraint solvers. Second, wewant BEK to be sufficiently expressive to closely modelreal-world code (such as Example 2). In this section

Bool ConstantsB ∈ {t, f}Char Constants d ∈ Σ

Bool Variables b, . . .Char Variables cString Variables t

Strings sexpr ::= iter(c in sexpr) {init} {case∗}| fromLast(ccond, sexpr)| uptoLast(ccond, sexpr) | t

init ::= (b := B)∗

case ::= case(bexpr) {cstmt}| endcaseendcase ::= end(ebexpr){yield(d)∗}cstmt ::= (b := ebexpr; | yield(cexpr);)∗

Booleans bexpr ::= Boolcomb(bexpr) |B | b | ccondebexpr ::= Boolcomb(ebexpr) |B | bccond ::= Boolcomb(ccond) |cexpr = cexpr

| cexpr < cexpr | cexpr > cexprChar strings cexpr ::= c | d | built-in-fnc(c) | cexpr + cexpr

Figure 4: Concrete syntax for BEK. Well-formed BEK

expressions are functions of typestring → string;the language provides basic constructs to filter and trans-form the single input stringt. Boolcomb(e) stands forBoolean combination ofe using conjunction, disjunc-tion, and negation.

we first present the BEK language. We then define thesemantics of BEK programs in terms ofsymbolic finitetransducers(SFTs), an extension of classicalfinite statetransducers. Finally, we describe several core decisionprocedures for SFTs that provide an algorithmic founda-tion for efficient static analysisand verification of BEK programs.

3.1 TheBEK Language

Figure 4 describes the language syntax. We define a sin-gle string variable,t, to represent an input string, anda number of expressions that can take eithert or an-other expression as their input. TheuptoLast(ϕ, t) andfromLast(ϕ, t) are built-in search operations that ex-tract the prefix (suffix) oft upto (from) and excludingthe last occurrence of a character satisfyingϕ. Theseconstructs are listed separately because they cannot beimplemented using other language features. Finally, theiter construct allows for character-by-character iterationover a string expression.

Example 3. uptoLast(c = ‘.’,"w.abc.org")= "www.abc", fromLast(c = ‘.’,"w.abc.org")="org". �

Theiter construct is designed to model loops that tra-verse strings while making imperative updates to booleanvariables. Given a string expression (sexpr), a char-acter variablec, and an initial boolean state (init), thestatement iterates over characters insexpr and evaluatesthe conditions of the case statements in order. When acondition evaluates to true, the statements incstmt mayyield zero or more characters to the output and update theboolean variables for future iterations. Theendcase ap-plies when the end of the input string has been reached.When no case applies, this correspond to yielding zero

characters and the iteration continues or the loop termi-nates if the end of the input has been reached.

3.2 Finite Transducers

We start with the classical definition offinite state trans-ducers. The particular sublass of finite transducers thatwe are considering here are also calledgeneralized se-quential machinesor GSMs [29], however, this defini-tion is not standardized in the literature, and we there-fore continue to say finite transducers for this restrictedcase. The restriction is that, GSMs read one symbol ateach transition, while a more general definition allowstransitions that skip inputs.

Definition 1. A Finite TransducerA is defined as a six-tuple(Q, q0, F,Σ,Γ,∆), whereQ is a finite set ofstates,q0 ∈ Q is theinitial state,F ⊆ Q is the set offinal states,Σ is theinput alphabet, Γ is theoutput alphabet, and∆is thetransition functionfromQ× Σ to 2Q×Γ∗

.

We indicate a component of a finite transducerA byusingA as a subscript. For(q, v) ∈ ∆A(p, a) we define

the notationpa/v−→A q, wherep, q ∈ QA, a ∈ ΣA and

v ∈ Γ∗A. We writep

a/v−→ q whenA is clear from the

context. Given wordsv andw we let v · w denote theconcatenation ofv andw. Note thatv · ε = ε · v = v.

Givenqiai/vi−→A qi+1 for i < n we writeq0

u/v−→A qn

whereu = a0 ·a1 ·. . .·an−1 andv = v0 ·v1 ·. . .·vn−1. We

write alsoqε/ε−→A q. A induces thefinite transduction,

TA : Σ∗A → 2Γ

∗

A :

TA(u)def= {v | ∃q ∈ FA (q0A

u/v−→ q)}

We lift the definition to sets,TA(U)def=

⋃u∈U T (u).

Given two finite transductionsT1 andT2, T1 ◦ T2 de-notes the finite transduction that maps an input wordu tothe setT2(T1(u)). In the following letA andB be finitetransducers. A fundamental composition ofA andB isthe join composition ofA andB.

Definition 2. Thejoin ofA andB is the finite transducer

A◦Bdef= (QA×QB, (q

0A, q

0B), FA×FB,ΣA,ΓB,∆A◦B)

where, for all(p, q) ∈ QA ×QB anda ∈ ΣA:

∆A◦B((p, q), a)def= {((p′, q), ε) | p

a/ε−→A p

′}

∪ {((p′, q′), v) | (∃u ∈ Γ+A)

pa/u−→A p

′, qu/v−→B q′}

The following property is well-known and allows usto drop the distinction betweenA andTAwithout causing ambiguity.

Proposition 1. TA◦B = TA ◦ TB.

The following classification of finite transducers plays acentral role in the sections discussing translation fromBEK and decision procedures forsymbolic finite transducers.

Definition 3. A is single-valuedif for all u ∈ Σ∗A,

|A(u)| ≤ 1.

3.3 Symbolic Finite Transducers

Symbolic finite transducers, as defined below, provide asymbolic representation of finite transducers using termsmodulo a given background theoryT . The backgrounduniverseV of values is assumed to bemulti-sorted, whereeach sortσ corresponds to a sub-universeVσ. Theboolean sort isBOOL and contains the truth valuest(true) andf (false). Definition of terms and formulas(boolean terms) is standard inductive definition, usingthe function symbols and predicate symbols ofT , log-ical connectives, as well asuninterpreted constantswithgiven sorts. All terms are assumed to be well-sorted. Atermt of sortσ is indicated byt : σ. Given a termt and asubstitutionθ from variables (or uninterpreted constants)to terms or values,Subst(t, θ) denotes the term resultingfrom applying the substitutionθ to t.

A model is a mapping of uninterpreted constants tovalues.3 A model for a termt is a model that providesan interpretation for all uninterpreted constants that oc-cur in t. (All free variables are treated as uninterpretedconstants.) Theinterpretationor valueof a termt in amodelM for t is given by standard Tarski semantics us-ing induction over the structure of terms, and is denotedby tM . A formula (predicate)ϕ is true in a modelMfor ϕ, denoted byM |= ϕ, if ϕM evaluates to true. Aformulaϕ is satisfiable, denoted byIsSat(ϕ), if thereexists a modelM such thatM |= ϕ. Any termt:σ thatincludes no uninterpreted constants is called avalue termand denotes a concrete value[[t]] ∈ Vσ.

Let TermγT (x) denote the set of all terms inT of sort

γ, wherex = x0, . . . , xn−1 may occur as the only un-interpreted constants (variables). LetPredT (x) denoteTermBOOL

T (x). In order to avoid ambiguities in notation,given a setE of elements, we write[e0, . . . , en−1] forelements ofE∗, i.e., sequences of elements fromE. Weuse both[] andε to denote the empty sequence. As above,if e1, e2 ∈ E∗, thene1 · e2 ∈ E∗ denotes the con-catenation ofe1 with e2. We lift the interpretation ofterms to apply to sequences: foru = [u0, . . . , un−1] ∈

TermγT (x)

∗ letuMdef= [uM0 , . . . , u

Mn−1] ∈ (Vγ)∗.

In the following letc:σ be afixeduninterpreted con-stant of sortσ. We refer toc:σ as theinput variable(forthe given sortσ).

Definition 4. A Symbolic Finite Transducer (SFT) forTis a six-tuple(Q, q0, F, σ, γ, δ), whereQ is a finite set ofstates, q0 ∈ Q is the initial state, F ⊆ Q is the set of

3The interpretations of background functions ofT is fixed and isassumed to be an implicit part of all models.

// GFED@ABC?>=<89:;q0

(c 6=′.′)/[c]

�� (c=′.′)/[],,

(c=′.′)/[c] ++

GFED@ABC?>=<89:;q1

(c 6=′.′)/[]

��

GFED@ABCq2

(t)/[c]

UU

(c=′.′)/[]

==

Figure 5: Symbolic finite state transducer foruptoLast(c=‘.’, input). This transducer is non-deterministic; there are two transitions that match‘.’from stateq0.

final states, σ is the input sort, γ is theoutput sort, andδ is thesymbolic transition functionfromQ×PredT (c)to 2Q×Term

γT(c)∗ .

We use the notationpϕ/u−→A q for (q,u) ∈ δA(p, ϕ)

and callpϕ/u−→A q a symbolic transition, ϕ/u is called

its label, ϕ is called itsinput (guard)andu its output.An SFT A = (Q, q0, F, σ, γ, δ) denotes the finite

transducer[[A]] = (Q, q0, F,Vσ,Vγ ,∆) wherepa/v−→[[A]]

q if and only if there existspϕ/u−→A q and a modelM

such thatM |= ϕ, cM = a, uM = v.For an STFA let the underlyingtransductionTA be

T[[A]]. For a stateq ∈ QA let T qA(v) (T q[[A]](v)) denotethe set of outputs when starting fromq with input v. Inparticular, ifq = q0A thenTC = T qA andT[[A]] = T q[[A]].The following proposition follows directly from the def-inition of [[A]].

Proposition 2. For v ∈ Σ∗[[A]] and q ∈ QA: T qA(v) =

T q[[A]](v).

Example 4. The identitySFT Id (for sortσ) is defined

follows. Id = ({q}, q, {q}, σ, σ, {qt/[c]−→ q}). Thus, for

all a ∈ Vσ, qa/a−→[[Id]] q, and [[Id ]](v) = {v} for all

v ∈ (Vσ)∗. �

Example 5. Assumeσ is the sort for characters. Thepredicatec = ‘.’ says that the input character is a dot.The SFTUptoLastDot such that for all stringsv,

UptoLastDot(v) = uptoLast(c = ‘.’, v),

whereuptoLast is the BEK function introduced above,is shown in Figure 5. �

Composition works directly with SFTs, and keeps theresulting SFTcleanin the sense that all symbolic transi-tions arefeasible, and eliminates states that areunreach-able from the initial stateas well as non-initial statesthat arenot backwards reachable from any final state. Inorder to preserve feasibility of transitions the algorithmuses a solver for checking satisfiability of formulas inPredT (c).

3.4 BEK to SFT translation

The basic sort needed in this section, besidesBOOL, isa sortCHAR for characters. We also assume the back-ground relation< : CHAR × CHAR → BOOL as a stricttotal order corresponding to the standard lexicographicorder over ASCII (or Unicode) characters and assume>,≤ and≥ to be defined accordingly. We also assume thateach individual character has a built-in constant such as‘a’:CHAR. For example,

(‘A’ ≤ c ∧ c ≤ ‘Z’) ∨ (‘a’ ≤ c ∧ c ≤ ‘z’)∨(‘0’ ≤ c ∧ c ≤ ‘9’) ∨ c = ‘ ’

descibes the regex character class\w of all word char-acters in ASCII. (Direct use of regex character classesin BEK, such ascase(\w) {. . .}, is supported in the en-hanced syntax supported in the BEK analyzer tool.)

Each sexpr e is translated into an SFTSFT (e).For the string variablet, SFT (e) = Id , with Id

as in Example 4. The translation ofuptoLast(ϕ, e)is the symbolic compositionSTF (e) ◦ B where Bis an SFT similar to the one in Example 5, exceptthat the conditionc = ‘.’ is replaced byϕ. Thetranslation offromLast(ϕ, e) is analogous. Finally,SFT (iter(c in e) {init} {case∗}) = SFT (e) ◦ BwhereB = (Q, q0, Q, CHAR, CHAR, δ) isconstructed as follows:

Step 1: Normalize. Transformcase∗ so that case con-ditions are mutually exclusive by adding the nega-tions of previous case conditions as conjuncts to allthe subsequent case conditions, and ensure that eachboolean variable has exactly one assignment in eachcstmt (add the trivial assignmentb := bif b is not assigned).

Step 2: Compute states.Compute the set of statesQ.Let q0 be an initial state as the truth assignment toboolean variables declared ininit.4 Compute thesetQ of all reachable states, by using DFS, suchthat, given a reached stateq, if there exists a casecase(ϕ) {cstmt} such thatSubst(ϕ, q) is satisfi-ablethen add the state

{b 7→ [[Subst(ψ, q)]] | b := ψ ∈ cstmt} (1)

toQ. (Note thatSubst(ψ, q) is a value term.)

Step 3: Compute transitions. Compute the symbolictransition functionδ. For each stateq ∈ Q andfor each casecase(ϕ) {cstmt} such thatφ =Subst(ϕ, q) is satisfiable. Letp be the state com-puted in (1). Letyield(u0), . . . ,yield(un−1) bethe sequence of yields incstmt and let u =[u0, . . . , un−1]. Add the symbolic

transitionqφ/u−→ p to δ.

4Note thatq0 is the empty assignment ifinit is empty, which trivi-alizes this step.

// GFED@ABC?>=<89:;q0

(c/∈{′′′,′”′,′\′})/[c]

��

(c∈{′′′,′”′})/[′\′, c]

UU

(c=′\′)/[c]++ GFED@ABC?>=<89:;q1

(t)/[c]

kk

Figure 6: SFT for BEK program in Example 1. ThisSFT escapes single and double quotes with a backslash,except if the current symbol is already escaped. The ap-plication of this SFT is idempotent.

The translation of end-cases is similar, resulting in sym-bolic transitions with guardc = ⊥, where⊥ is a spe-cial character used to indicate end-of-string. We assume⊥ to be least with respect to<. For example, assum-ing that the BEK programs use concrete ASCII charac-ters,⊥:CHAR is either anadditionalcharacter, or the nullcharacter‘\0’ if only null-terminated strings are consid-ered as valid input strings. Although practically impor-tant, end-cases do not cause algorithmic complications,and for the sake of clarity we avoid themin further discussion.

The algorithm uses a solver to check satisfiability ofguard formulas. If checking satisfiability of a formula forexample times out, then it is safe to assume satisfiabil-ity and to include the corresponding symbolic transition.This will potentially add infeasible guards but retains thecorrectnessof the resulting SFT, meaning that the under-lying finite transduction is unchanged. While in mostcases checking satisfiability of guards seems straight-forward, but when considering Unicode, this perceptionis deceptive. As an example, the regex character class[\W-[\D]] denotes an empty set since\d is a subset of\w and \W (\D) is the complement of\w (\d), and thus,[\W-[\D]] is the intersection of\W and\d. Just the charac-ter class\w alone contains 323 non-overlapping ranges inUnicode, totaling 47,057 characters. A naıve algorithmfor checking satisfiability (non-emptiness) of[\W-[\D]]

may easily time out.Consider the BEK program in Example 1. The cor-

responding SFT constructed by the above translation isshown in Figure 6. There are two symbolic transitionsfrom stateq0 to itself. The first corresponds to the caseswhere the input characterc needs to be escaped, and thesecond to cases where the input does notneed to be escaped.

3.5 Join Composition and Equivalence

We now give an informal description of our core algo-rithms for reasoning about SFTs:join compositionandequivalence. We then show how these algorithms can beused to check properties such as idempotence, existenceof an input yielding a target string, and commutativity.

Thejoin compositionA ◦B corresponds to a programtransformation that constructs a single loop over the in-put string out of two consecutive loops in SFTsA andB.

The join composition algorithm constructs an SFTA◦Bsuch thatT[[A◦B]] = T[[A]]◦T[[B]]. The intuition behind theconstruction is that the outputs produced byA are sub-stitutedsymbolicallyin as the inputs consumed by theB. The composition algorithm proceeds by depth-firstsearch, first computingQA◦B as constructed as a reach-able subset ofQA × QB, starting from(q0A, q

0B). Here

we use the SMT solver to determine reachability, callingthe solver as a black box to determine if a path from onestate to another is feasible or not. This makes our con-structionindependentof the particular background the-ory. In general, this is not true for other recent exten-sions of finite transducers such as streaming transduc-ers [6], where compositionality depends on properties ofthe background theory that is being used.

Two SFTsA andB areequivalentif TA = TB. Let

Dom(A)def= {v | TA(v) 6= ∅}.

Checking equivalence ofA andB reduces to two sepa-rate tasks:

1. Deciding domain-equivalence: Dom(A) =Dom(B).

2. Deciding partial-equivalence: for all v ∈Dom(A) ∩ Dom(B), TA(v) = TB(v).

Note that 1 and 2 are independent and do not implyeach other, but together they imply equivalence. Do-main equivalence holds for all SFTs constructed by BEK,because all programs share the same domain, namelythat of strings. Checking partial equivalence is more in-volved. We leverage the fact that all SFTs we constructare single-valued. Our equivalence algorithm first com-putes the join composition ofA andB, then uses theSMT solver to search for inputs that causeA to differfrom B. We have anonconstructiveproof of termina-tion for this algorithm: it establishes that ifA andBare equivalent, then the search must terminate in timequadratic in the number of states of the composed au-tomata. In practice, the SMT solver carries out thissearch, and our results in Section 4 show scaling is closerto linear in practice.

Equivalence and join composition allow us to carry outa variety of other analyses. Idempotence of an SFTAcan be first checked by computingB = A ◦ A, thenchecking the equivalence ofA andB. If the two SFTs arenot equivalent, thenA fails to be idempotent. Similarly,commutativity of two SFTsA andB can be determinedby computingC = A◦B andD = B ◦A, then checkingequivalence. The idea is illustrated in Figure 7. We canalso compute theinverse imageof a SFT with respect to astrings, which lets us find out the set of inputs to the SFTthat yields as an output. We use all of these analyses tocheck sanitizers for securityproperties in the next section.

Our approach has an advantage over traditional finitetransducers (FTs), due to succinctness of SFTs. Suppose

� ^]v�µ��]vP_ z

A not

idempotent

A x A

A A

A

� ^]v�µ��]vP_ z A and B not

commutative

B x A

B A

A x B

A B

Figure 7: Using composition and equivalence of SFTsto decide idempotence and commutativity.

for example that the background character theoryT is k-bit bit vector arithmetic wherek depends on the desiredcharacter range (e.g., for Unicode,k = 16). An explicitexpansion of a BEK SFTA to [[A]] may increase the size(nr of transitions) by a factor of2k. Partial-equivalenceof single-valued FTs is solvableO(n2) [15] time. Thus,for an SFTA of sizen, using the partial-equivalence al-gorithm for [[A]] takesO((2kn)2) time. In contrast, thepartial-equivalence algorithm for BEK SFTs isO(n2).When the background theory is linear arithmetic, thenthe alphabet is infinite and a correspoding FT algorithmis therefore not even possible.

4 Evaluation

In the following subsections, we evaluate the real-worldapplicability of BEK in terms of expressivess,utility, and performance:

• Section 4.1 evaluates whether BEK can model ex-isting real-world code. We conduct an empericalstudy of a large body of code to see how widely-used BEK-modelable sanitizer functions are (Sec-tion 4.1.1), and we evaluate which BEK featuresare needed to model sanitizers from AutoEscape,OWASP, and Internet Explorer 8 (Section 4.1.2).

• We put BEK to work to check existing sanitizers foridempotence, commutativity, and reversibility (Sec-tion 4.2).

• We perform pair-wise equivalence checks on a num-ber of portedHTMLEncode implementations, as wellas two outsourced implementations (Section 4.3).

• We evaluate effectiveness of existingHTMLEncodeimplementations against known attack strings takenfrom the Cross-site Scripting Cheat Sheet (Sec-tion 4.4).

• We use a synthetic benchmark to evaluate the scal-ability of performing equivalence checks on BEK

programs (Section 4.5).

• We provide a short example to highlight the factthat BEK programs can be readily translated to otherprogramming languages (Section 4.6).

These experiments are based on an implementation thatconsists of roughly5, 000 lines of C# code that imple-ments the basic transducer algorithms and Z3 [14] inte-gration, with another1, 000 lines of F# code for transla-tion from BEK to transducers. Our experiments were car-ried out on a Lenovo ThinkPad W500 laptop with 8 GBof RAM and an Intel Core 2 Duo P9600 processor run-ning at2.67 GHz, running 64-bit Windows 7.

4.1 Expressive Utility

Thus far, we discussed the expressiveness of BEK pri-marily in theoretical terms. In this subsection, we turnour attention to real-world applicability instead, througha case study that aims to demonstrate that a wide varietyof commonly used sanitizers can be ported toBEK with relative ease.

4.1.1 Frequency of Sanitizer use in PHP code.

PHP is a widely-used open source server-side scriptinglanguage. Minamide’s seminal work on the static anal-ysis of dynamic web applications [26] includes finite-transducer based models for a subset of PHP’s sanitizerfunctions. These transducers are hand-crafted in severalthousand lines of OCaml. We conducted an informal re-view of the PHP source to confirm that each transducercould be modeled as a BEK program.

Our goal is to perform a high-level quantitative com-parison of the applicability of BEK, on the one hand,and existing string constraint solvers (e.g., DPRLE [17],Hampi [20], Kaluza [30], and Rex [35]) on the other. Forthis comparison, we assume that each Minamide trans-ducer could instead be modeled as a BEK program. Wethen use statistics from a study by Hooimeijer [16] thatmeasured the relative frequency, by static count, of 111distinct PHP string library functions. The Hooimeijerstudy was conducted in December 2009, and covers thetop 100 projects onSourceForge.net, or about 9.6 mil-lion lines of PHP code. The study considered most, butnot all, sanitizers provided by Minamide.

Out of the 111 distinct functions considered in theHooimeijer study,27 were modeled as transducers byMinamide and thus encodable in BEK. In the sam-pled PHP code, these27 functions account for68, 238out of 251, 317 uses, or about 27% of all string-relatedcall sites. By comparison, traditional regular expressionfunctions modeled by tools like Hampi [20] and Rex [35]account for just 29,141 call sites, or about 12%. We notethat BEK could be readily integrated into an automaton-based tool like Rex, however, and our features are largelycomplimentary to those of traditional string constraintsolvers. These results suggest that BEK provides a signif-icant improvement in the “coverage” of real-world codeby string analysis tools.

4.1.2 Language Features

For the remainder of the experiments, we use a smalldataset of ported-to-BEK sanitizers. We now discussthat dataset and the manual conversion effort required.The results are summarized in Figure 8, and described inmore detail below.

Google AutoEscape and OWASP. We converted san-itizers from the OWASP sanitizer library to BEK pro-grams. We also evaluated sanitizers from the GoogleAutoEscape framework to determine what language fea-tures they would need to be expressed in BEK. Thesesanitizers are marked with prefixesGA andOWASP, re-spectively, in Figure 8. We verified that each of thesesanitizers can be implemented in BEK. In several cases,we find additional non–native features that could beadded to BEK to support these sanitizers.

Internet Explorer. In addition, we extracted sanitizersfrom the binary of Internet Explorer 8 that are usedin the IE Cross-Site Scripting Filter feature, denotedIEFilter1 to IEFilter17 in Figure 8. For this study,we analyze the behavior of the IE 8 sanitizers underthe assumption the server performs no sanitization ofits own on user data. Of these21 sanitizers, we couldconvert17 directly into BEK programs. The remaining4sanitizers track a potentially unbounded list of charactersthat are either emitted unaltered or escaped, dependingon the result of a regular expression match. BEK doesnot enable storing strings of input characters.

The manual translation took several hours per sani-tizer. Figure 8 breaks down our BEK programs based on“Native” features of the BEK language, and “Not Native”features which are not currently in the BEK language.Many of these features can be integrated modeled usingtransducers, however, by enhancing the language of con-straints used for symbolic labels. In addition, with theexception of4 Internet Explorer sanitizers, we found thata maximum lookahead window of eight characters wouldsuffice for handling all our sanitizers. Finally, we discov-ered that the arithmetic on characters was limited to rightshifts and linear arithmetic, which can be expressed inthe Z3 solver we use.

We note that all “Not Native” features could be addedto the BEK language with few or no changes to the under-lying SFT algorithms for join composition and equiva-lence checking: only the front end would need to change.

4.1.3 Browser Code

Ideally, we could use BEK to model the parser of an ac-tual web browser. Then, we could use our analyses tocheck whether there exists a string that passes through agiven sanitizer yet causes javascript execution. We per-formed a preliminary exploration of the WebKit browserto determine how difficult it would be to write such

Native Not Nativeboolean multiple mult.

Name vars iters regex lookahead arith. functions

a2bb2a 1 ✗ X ✗ ✗ ✗escapeBrackets 1 X ✗ ✗ ✗ ✗escapeMetaAndLink 1 X X ✗ ✗ ✗escapeString0 1 ✗ ✗ ✗ ✗ ✗escapeString 1 ✗ ✗ ✗ ✗ ✗escapeStringSimple 1 ✗ ✗ ✗ ✗ ✗getFileExtension 2 ✗ ✗ ✗ ✗ ✗GA HtmlEscape 0 ✗ ✗ ✗ ✗ ✗GA PreEscape 0 ✗ ✗ ✗ ✗ ✗GA SnippetEsc 3 ✗ ✗ X ✗ ✗GA CleanseAttrib 1 ✗ ✗ X ✗ ✗GA CleanseCSS 0 ✗ ✗ ✗ ✗ ✗GA CleanseURLEsc 0 ✗ ✗ ✗ ✗ ✗GA ValidateURL 2 X ✗ X X ✗GA XMLEsc 0 ✗ ✗ ✗ ✗ ✗GA JSEsca 0 ✗ ✗ X ✗ ✗GA JSNumber 2 X ✗ X ✗ ✗GA URLQueryEsc 1 X ✗ ✗ X ✗GA JSONESc 0 ✗ ✗ ✗ ✗ ✗GA PrefixLine 0 ✗ ✗ ✗ ✗ ✗OWASPHTMLEncode 0 ✗ ✗ X ✗ ✗IEFilter1 3 ✗ X ✗ ✗ ✗IEFilter2 4 ✗ X ✗ ✗ ✗IEFilter3 5 ✗ X ✗ ✗ ✗IEFilter4 4 ✗ X ✗ ✗ ✗IEFilter5 4 ✗ X ✗ ✗ ✗IEFilter6 5 ✗ X ✗ ✗ ✗IEFilter7 4 ✗ X ✗ ✗ ✗IEFilter8 4 ✗ X ✗ ✗ ✗IEFilter9 5 ✗ X ✗ ✗ ✗IEFilter10 5 ✗ X ✗ ✗ ✗IEFilter11 4 ✗ X ✗ ✗ ✗IEFilter12 4 ✗ X ✗ ✗ ✗IEFilter13 4 ✗ X ✗ ✗ ✗IEFilter14 4 ✗ X ✗ ✗ ✗IEFilter15 1 ✗ X ✗ ✗ ✗IEFilter16 1 ✗ X ✗ ✗ ✗IEFilter17 1 ✗ X ✗ ✗ ✗

Figure 8: Expressiveness: different language featuresused by the original corpus of different programs. Across means that the feature was not used by the pro-gram in its initial implementation. A checkmark meansthe feature was used by the program. boolean variables,multiple iterations over a string, and regular expressionsare native constructs in BEK. Multiple lookahead, arith-metic, and functions are not native to BEK and must beemulated during the translation. We also show the dis-tinct boolean variablesused by the BEK implementation.

a model with BEK. Unfortunately, we found multiplefunctions that require features, such as bounded looka-head and transducer composition, which are not yet sup-ported by the BEK language.

For example, we considered a function in the Safariimplementation of WebKit that performs Javascript de-coding [7]. This function requires at a minimum the useof functions to connect hexadecimal to ASCII, a looka-head of5 characters, function composition, and scan-ning for occurrences of a target character. While asnoted above we believe these features could be addedto BEK without fundamentally changing the underlyingalgorithms for symbolic transducers, the BEK languagedoes not yet support them.

4.2 Checking Algebraic Properties

We argued in Section 2 that idempotence and commuta-tivity are key properties for sanitizers. In addition, theproperty ofreversibility, that from the output of a sani-tizer we can unambiguously recover the input, is impor-tant as an aid to debugging.

4.2.1 Order Independence

We now evaluate whether17 sanitizers used in IE 8 areorder independent. Order independence means that thesanitizers have the same effect no matter in what orderthey are applied. If the order does matter, then the choiceof order can yield surprising results. As an example, inrule-based firewalls, a set of rules that are not order in-dependent may result in a rule never being applied, eventhough the administrator of the firewall believes the ruleis in use.

Each IE 8 sanitizer defines a specificinput set onwhich it will transform strings, which we can computefrom the BEK model. We began by checking all136 pairsof IE 8 sanitizers to determine whether their input setswere disjoint. Only one pair of sanitizers showed a non-trivial intersection in their input sets. A non-trivial in-tersection signals a potential order dependence, becausethe two sanitizers will transform the same strings. Forthis pair, we used BEK to check that the two sanitizersoutput the same language, when restricted to inputs fromtheir intersection. BEK determined that the transforma-tion of the two sanitizers on thesel inputs was exactly thesame — i.e., the two sanitizers were equivalent on theintersection set. We conclude that the IE 8 sanitizers arein fact order independent, up to errors in our extractionof the sanitizers and our assumption that no server-sidemodification is present.

4.2.2 Idempotence and Reversibility

We now examine the idempotence of several BEK pro-grams, including the IE 8 sanitizers. Figure 9 reportsthe results. The number of states in the symbolic finitetransducer created from each BEK program. For eachtransducer, we then report whether it is idempotent andwhether it is reversible. This shows the number of statesacts as a rough guide to the complexity of the sanitizer.For example, we see that IE filter9 out of 17 is quitecomplicated, with25 states.

4.2.3 Commutativity

We investigated commutativity of seven different imple-mentations ofHTMLEncode, a sanitizer commonly usedby web applications. Four implementations were gath-ered from internal sources. Three were created for ourproject specifically by hiring freelance programmers tocreate implementations from popular outsourcing websites. We provided these programmers with a highlevel specification in English that emphasized protection

Name States Idempotent? Reversible?

a2bb2a 1 ✗ X

escapeBrackets 1 X ✗escapeMetaAndLink 1 X X

escapeString0 1 ✗ ✗escapeString 1 ✗ ✗escapeStringSimple 1 ✗ ✗getFileExtension 2 ✗ ✗IEFilter1 6 X ✗IEFilter2 9 X ✗IEFilter3 19 X ✗IEFilter4 13 X ✗IEFilter5 13 X ✗IEFilter6 16 X ✗IEFilter7 13 X ✗IEFilter8 12 X ✗IEFilter9 25 X ✗IEFilter10 18 X ✗IEFilter11 11 X ✗IEFilter12 11 X ✗IEFilter13 14 X ✗IEFilter14 14 X ✗IEFilter15 1 X ✗IEFilter16 1 X ✗IEFilter17 1 X ✗

Figure 9: For each BEK benchmark programs, we reportthe number of states in the corresponding symbolic trans-ducer. We then report whether the transducer is idempo-tent, and whether the transducer is reversible.

HTMLEncode1 X X X ✗ ✗ X ✗HTMLEncode2 X X X ✗ ✗ X ✗HTMLEncode3 X X X ✗ ✗ X ✗HTMLEncode4 ✗ ✗ ✗ X ✗ ✗ ✗Outsourced1 ✗ ✗ ✗ ✗ X ✗ ✗Outsourced2 X X X ✗ ✗ X ✗Outsourced3 ✗ ✗ ✗ ✗ ✗ ✗ X

Figure 10: Commutativity matrix for seven different im-plementations ofHTMLEncode. TheOutsourced imple-mentations were written by freelancers from a high levelEnglish specification.

against cross-site scripting attacks. Figure 10 shows acommutativity matrixfor the HTMLEncode implementa-tions. A X indicates the pair of sanitizers commute,while a✗ indicates they do not. The matrix contains12check marks out of42 total comparisons of distinct sani-tizers, or28.6%. Our implementation took less than oneminute to complete all42 comparisons.

4.3 Differences Between Multiple Implementations

Multiple implementations of the “same” functionality arecommonly available from which to choose when writinga web application. For example, newer versions of a li-brary may update the behavior of a piece of code. Differ-ent organizations may also write independent implemen-tations of the same functionality, guided by performanceimprovements or by different requirements. Given thesedifferent implementations, the first key question is “doall these implementations compute the same function?”Then, if there are differences, the second key question is“how do these implementations differ?”

As described above, because BEK programs corre-spond to single valued symbolic finite state transduc-ers, computing the image of regular languages under the

HTMLEncode1 X X X 0 − X 0

HTMLEncode2 X X X 0 − X 0

HTMLEncode3 X X X 0 − X′

HTMLEncode4 0 0 0 X 0 0 0

Outsourced1 − − − 0 X − 0

Outsourced2 X X X 0 − X 0

Outsourced3 0 0 ′ 0 0 0 X

Figure 11: Equivalence matrix for our implementationsof HTMLEncode. A X indicates the implementations areequivalent. For implementations that are not equivalent,we show an example character that exhibits different be-havior in the two implementations. The symbol0 refersto the null character.

function defined by a BEK program is decidable. By tak-ing the image ofΣ∗ under two different BEK programs,we can determine whether they output thesame set of strings.

We checked equivalence of seven different implemen-tations in C# (as explained above) of theHTMLEncodesanitization function. We translated all seven implemen-tations to BEK programs by hand. First, we discoveredthat all seven implementations had only one state whentransformed to a symbolic finite transducer. We thenfound that all seven are neither reversible nor idempotent.For example, the ampersand character& is expanded to& by all seven implementations. This in turn con-tains an ampersand that will be re-expanded on futureapplications of the sanitizer, violating idempotence.

For each BEK program, we checked whether it wasequivalent to the otherHTMLEncode implementations.Figure 11 shows the results. For cases where thetwo implementations are not equivalent, BEK deriveda counterexample string that is treated differently bythe two implementations. For example, we discov-ered thatOutsourced1 escapes the− character, whileOutsourced2 does not. We also found that one of theHTMLEncode implementations does not encode the sin-gle quote character. Because the single quote charac-ter can close HTML contexts, failure to encode it couldcause unexpected behavior for a web developer who usesthis implementation. For example, a recent attack on theGoogle Analytics dashboard was enabled by failure tosanitize a single quote [33].

This case study shows the benefit of automatic analy-sis of string manipulating functions to check equivalence.Without BEK, obtaining this information using manualinspection would be difficult, error prone, and time con-suming. With BEK, we spent roughly3 days total trans-lating from C# to BEK programs. Then BEK was ableto compute the contents of Figure 11 in less than oneminute, including all equivalenceand containment checks.

HTML AttributeImplementation context context

HTMLEncode1 100% 93.5%HTMLEncode2 100% 93.5%HTMLEncode3 100% 93.5%HTMLEncode4 100% 100%Outsourced1 100% 93.5%Outsourced2 100% 93.5%Outsourced3 100% 93.5%

Figure 12: Percentage of XSS Cheat Sheet strings, inboth HTML tag context and tag attribute contexts, thatare ruled out by each implementation ofHTMLEncode.

4.4 Checking Filters Against The Cheat Sheet

The Cross-Site Scripting Cheat Sheet (“XSS CheatSheet”) is a regularly updated set of strings that triggerJavaScript execution on commonly used web browsers.These strings are specially crafted to cause popular webbrowsers to execute JavaScript, while evading commonsanitization functions. Once we have translated a sani-tizer to a program in BEK, because BEK uses symbolicfinite state transducers, we can take a “target” string anddetermine whether there exists a string that when fed tothe sanitizer results in the target. In other words, wecan check whether a string on the Cheat Sheet has apre-imageunder the function defined by a BEK program.

We sampled28 strings from the Cheat Sheet. TheCheat Sheet shows snippets of HTML, but in practice asanitizer might be run only on a substring of the snip-pet. We focused on the case where a sanitizer is runon the HTML Attribute field, extracting sub-strings fromthe Cheat Sheet examples that correspond to the attributeparsing context. WhileHTMLEncode should not be usedfor sanitizing data that will become part of a URL at-tribute, in practice programmers may accidentally useHTMLEncode in this “incorrect” context. We also addedsome strings specifically to check the handling of HTMLattribute parsing by our sanitizers. As a result, we ob-tained two sets of attack strings: HTML and Attribute.

For each of our implementations, for all strings ineach set, we then asked BEK whether pre-images of thatstring exist. Figure 12 shows what percentage of stringshave no pre-image under each implementation. All sevenimplementations correctly escape angle brackets, so nostring in the HTML set has a pre-image under any of thesanitizers. In the case of the Attribute strings, however,we found that some of the implementations do not escapethe string“&#”, potentially yielding an attack. Only oneof our implementations ofHTMLEncode made it impos-sible for all of the strings in the Attribute set from ap-pearing in its output. Each set of strings took between36and39 seconds for BEK to check the entire set of stringsagainst a sanitizer.

Figure 13: Self-equivalence experiment.

4.5 Scalability of Equivalence Checking

Our theoretical analysis suggests that the speed ofqueries to BEK should scale quadratically in the numberof states of the symbolic finite transducer. All sanitiz-ers we have found in “the wild,” however, have a smallnumber of states. While this makes answering queriesabout the sanitizers fast, it does not shed light on the em-pirical performance of BEK as the number of states in-creases. To address this, we performed two experimentswith synthetically generated symbolic finite transducers.These transducers were specially created to exhibit someof the structure observed in real sanitizers, yet have manymore states than observed inpractical sanitizer implementations.

Self-equivalence experiment. We generated symbolicfinite transducersA from randomly generated BEK pro-grams having structure similar to typical sanitizers. Thetime to check equivalence ofA with itself is shown inFigure 13 where the size is the number of states plusthe number of transitions inA. Although the worst casecomplexity is quadratic, the actual observed complexity,for a sample size of 1,000, is linear.

Commutativity experiment. We generated symbolicfinite transducers from randomly generated BEK pro-grams having structure similar to typical santizers. Foreach symbolic finite transducerA, we checked commu-tativity with a small BEK programUpToLastDotthat re-turns a string up to the last dot character. The time todetermine thatA ◦ UpToLastDotandUpToLastDot◦ Aareequivalentis shown in Figure 14 where the size is thetotal number of states plus the number of transitions inA. The time to check non-equivalence was in most casesonly a few milliseconds, thus all experiments exclude thedata where the result isnot equivalent, and only includecases where the result isequivalent. Although the worstcase complexity is quadratic, the actual observed com-plexity, over a sample size of 1,000individual cases, was near-linear.

Figure 14: Commutativity experiment.

4.6 From BEK to Other Languages

We have built compilers from BEK programs to com-monly used languages. When the time comes for deploy-ment, the developer can compile to the language of herchoice for inclusion into an application.

// orginal Bek program

program test0(t);

string s;

s := iter(c in t)

{b := false;} {

case ((c == ’a’)): i

b := !(b) && b;

b := b || b;

b := !(b);

yield (c);

case (true) :

yield (’$’);

};

//

// JavaScript translation

//

function test0(t) {

var s = function ($){

var result = new Array();

for(i=0;i<$.length; i++){

var c = $[i];

if ((c == String.fromCharCode(97))) {

b = (!(b) && b);

b = (b || b);

b = !(b);

result.push(c);

}

if (t) {

result.push(String.fromCharCode(36));

}

};

return result.join(’’);

}

return s(t);

}

Figure 15: A small example BEK program (top) and itscompiled version in JavaScript (bottom). Note the use ofresult.push instead of explicit array assignment.

Figure 15 shows a small example of a BEK programand the result of its JavaScript compilation. As part ofthe compilation, we have taken advantage of our knowl-edge of properties of JavaScript to improve the speed ofthe compiled code. For example, we push characters intoarrays instead of creating new string objects. The resultis standard JavaScript code that can be easily included inany web application. By adding additional compilers forcommon languages, such as C#, we can give a developermultiple implementations of a sanitizer that are guaran-teed to be equivalent for use in different contexts.

5 Related Work

SANER combines dynamic and static analysis to validatesanitization functions in web applications [9]. SANERcreates finite state transducers for an over-approximationof the strings accepted by the sanitizer using static anal-ysis of existing PHP code. In contrast, our work focuseson a simple language that is expressive enough to captureexisting sanitizers or write new ones by hand, but thencompile to symbolic finite state transducers that preciselycapture the sanitization function. SANER also treats theissue of inputs that may be tainted by an adversary, whichis not in scope for our work. Our work also focuses on ef-ficient ways to compose sanitizers and combine the the-ory of finite state transducers with SMT solvers, whichis not treated by SANER.

Minamide constructs a string analyzer for PHP code,then uses this string analyzer to obtain context free gram-mars that are over-approximations of the HTML outputby a server [26]. He shows how these grammars canbe used to find pages with invalid HTML. The methodproposed in [21] can also be applied to string analysisby modeling regular string analysis problems ashigher-order multi-parameter tree transducers(HMTTs) wherestrings are represented as linear trees. While HMTTs al-low encodings of finite transducers, arbitrary backgroundcharacter theories are not directly expressibly in order toencode SFTs. Our work treats issues of composition andstate explosion for finite state transducers by leveragingrecent progress in SMT solvers, which aids us in reason-ing precisely about the transducers created by transfor-mation of BEK programs and by avoiding state space ex-plosion and bitblasting for large character domains suchas Unicode. Moreover, SMT solvers provide a methodof extracting concrete counterexamples.

Wasserman and Su also perform static analysis ofPHP code to construct a grammar capturing an over-approximation of string values. Their application is toSQL injection attacks, while our framework allows us toask questions about any sanitizer [36]. Follow-on workcombines this work with dynamic test input generation tofind attacks on full PHP web applications [37]. Dynamicanalysis of PHP code, using a combination of symbolicand concrete execution techniques, is implemented in theApollo tool [8]. The work in [39] describes a layered

static analysis algorithm for detecting security vulnera-bilities in PHP code that is also enable to handle somedynamic features. In contrast, our focus is specificallyon sanitizers instead of on full applications; we empha-size analysis precision over scaling to large code bases.

Christensenet al.’s Java String Analyzer is a staticanalysis package for deriving finite automata that charac-terize an over-approximationof possible values for stringvariables in Java [13]. The focus of their work is on an-alyzing legacy Java code and on speed of analysis. Incontrast, we focus on precision of the analysis and onconstructing a specific language to capture sanitizers, aswell as on the integration with SMT solvers.

Our work is complementary to previous efforts in ex-tending SMT solvers to understand the theory of strings.HAMPI [20] and Kaluza [31] extend the STP solver tohandle equations over strings and equations with mul-tiple variables. Rex extends the Z3 solver to handleregular expression constraints [35], while Hooimeijeretal.show how to solve subset constraints on regular lan-guages [17]. We in contrast show how to combine anyof these solvers with finite transducers whose edges cantake symbolic values in any of the theoriessupported by the solver.

The work in [28] introduces the first symbolic ex-tension of finite state transducers called apredicate-augmented finite state transducer(pfst). A pfst has two

kinds of transitions: 1)pϕ/ψ−→ q whereϕ andψ are char-

acter predicates orε, or 2) pc/c−→ q. In the first case

the symbolic transition corresponds to all concrete tran-

sitionspa/b−→ q such thatϕ(a) andψ(b) are true, the

second case corresponds toidentity transitionspa/a−→ q

for all charactersa. A pfst is not expressive enough fordescribing an SFT. Besides identities, it is not possibleto establish functional dependencies from input to out-put that are needed for example to encode sanitizers suchasEncodeHtml.

A recent symbolic extension of finite transducers isstreaming transducers[6]. While the theoretical expres-siveness of the language introduced in [6] exceeds thatof BEK, streaming transducers are restricted to charac-ter theories that are total orders with no other operations.Also, composition of streaming transducers requires anexplicit treatment of characters. It is an interesting futureresearch topic to investigate if there is an extension ofSFTs or a restriction of streaming transducers that allowsefficient symbolic analysis techniques to be applied.

6 Conclusions

Much prior work in XSS prevention assumes the correct-ness of sanitization functions. However, practical expe-rience shows writing correct sanitizers is far from triv-ial. This paper presents BEK, a language and a compilerfor writing, analyzing string manipulation routines, andconverting them to general-purpose languages. Our lan-

guage is expressive enough to capture real web sanitizersused in ASP.NET, the Internet Explorer XSS Filter, andthe Google AutoEscape framework, which we demon-strate by porting these sanitizers to BEK.

We have shown how the analyses supported by ourtool can find security-critical bugs or check that suchbugs do not exist. To improve the end-user experiencewhen a bug is found, BEK produces a counter-example.We discover that only 28.6% of our sanitizers commute,∼79.1% are idempotent, and only 8% are reversibe. Wealso demonstrate that most hand-writtenHTMLEncodeimplementations disagree on at least some inputs. Un-like previously published techniques, BEK deals equallywell with Unicode strings without creating a state ex-plosion. Furthermore, we show that our algorithms forequivalence checking and composition computation areextremely fast in practice, scaling near-linearly with thesize of the symbolic finite transducer representation.

References

[1] About Safari 4.1 for Tiger. http://support.apple.com/kb/DL1045.

[2] Internet Explorer 8: Features.http://www.microsoft.com/windows/internet-explorer/features/safer.aspx.

[3] NoXSS Mozilla Firefox Extension. http://www.noxss.org/.

[4] OWASP: ESAPI project page. http://code.google.com/p/owasp-esapi-java/.

[5] XSS (Cross Site Scripting) Cheat Sheet.http://ha.ckers.org/xss.html.

[6] R. Alur and P. Cerny. Streaming transducers for algorithmicverification of single-pass list-processing programs. InProceed-ings of the Symposium on Princples of Programming Languages,pages 599–610, 2011.

[7] Apple. Jsdecode implementation, 2011.http://trac.webkit.org/browser/releases/Apple/Safari%205.0/

JavaScriptCore/runtime/JSGlobalObjectFunctions.

cpp.

[8] S. Artzi, A. Kiezun, J. Dolby, F. Tip, D. Dig, A. Paradkar, andM. D. Ernst. Finding bugs in Web applications using dynamictest generation and explicit-state model checking.Transactionson Software Engineering, 99:474–494, 2010.

[9] D. Balzarotti, M. Cova, V. Felmetsger, N. Jovanovic, E. Kirda,C. Kruegel, and G. Vigna. SANER: Composing static and dy-namic analysis to validate sanitization in Web applications. InProceedings of the Symposium on Security and Privacy, 2008.

[10] D. Bates, A. Barth, and C. Jackson. Regular expressionscon-sidered harmful in client-side XSS filters. InProceedings of theConference on the World Wide Web, pages 91–100, 2010.

[11] N. Bjørner, N. Tillmann, and A. Voronkov. Path feasibility analy-sis for string-manipulating programs. InProceedings of the Inter-national Conference on Tools And Algorithms For The Construc-tion And Analysis Of Systems, 2009.

[12] C. Y. Cho, D. Babic, E. C. R. Shin, and D. Song. Inferenceandanalysis of formal models of botnet command and control proto-cols. InProceedings of the Conference on Computer and Com-munications Security, pages 426–439, 2010.

[13] A. S. Christensen, A. Møller, and M. I. Schwartzbach. PreciseAnalysis of String Expressions. InProceedings of the Static Anal-ysis Symposium, 2003.

[14] L. de Moura and N. Bjørner. Z3: An Efficient SMT Solver. InProceedings of the International Conference on Tools And Algo-rithms For The Construction And Analysis Of Systems, 2008.

[15] A. J. Demers, C. Keleman, and B. Reusch. On some decidableproperties of finite state translations.Acta Informatica, 17:349–364, 1982.

[16] P. Hooimeijer. Decision procedures for string constraints. Ph.D.Dissertation Proposal, University of Virginia, April 2010.

[17] P. Hooimeijer and W. Weimer. A decision procedure for subsetconstraints over regular languages. InProceedings of the Con-ference on Programming Language Design and Implementation,pages 188–198, 2009.

[18] P. Hooimeijer and W. Weimer. Solving string constraints lazily. InProceedings of the International Conference on Automated Soft-ware Engineering, 2010.

[19] N. Jovanovic, C. Kruegel, and E. Kirda. Pixy: a static analysistool for detecting Web application vulnerabilities (shortpaper).In Proceedings of the Symposium on Security and Privacy, May2006.

[20] A. Kiezun, V. Ganesh, P. J. Guo, P. Hooimeijer, and M. D. Ernst.HAMPI: a solver for string constraints. InProceedings of theInternational Symposium on Software Testing and Analysis, 2009.

[21] N. Kobayashi, N. Tabuchi, and H. Unno. Higher-order multi-parameter tree transducers and recursion schemes for programverification. InProceedings of the Symposium on Principles ofProgramming Languages, pages 495–508, 2010.

[22] D. Lindsay and E. V. Nava. Universal XSS via IE8’s XSS filters.In Black Hat Europe, 2010.

[23] B. Livshits and M. S. Lam. Finding security errors in Java pro-grams with static analysis. InProceedings of the Usenix SecuritySymposium, pages 271–286, Aug. 2005.

[24] B. Livshits, A. V. Nori, S. K. Rajamani, and A. Banerjee.Merlin:Specification inference for explicit information flow problems. InProceedings of the Conference on Programming Language De-sign and Implementation, June 2009.

[25] M. Martin, B. Livshits, and M. S. Lam. SecuriFly: Runtimevulnerability protection for Web applications. Technicalreport,Stanford University, Oct. 2006.

[26] Y. Minamide. Static approximation of dynamically generatedweb pages. InProceedings of the International Conference onthe World Wide Web, pages 432–441, 2005.

[27] A. Nguyen-Tuong, S. Guarnieri, D. Greene, J. Shirley, andD. Evans. Automatically hardening Web applications using pre-cise tainting. InProceedings of the IFIP International Informa-tion Security Conference, June 2005.

[28] G. V. Noord and D. Gerdemann. Finite state transducers withpredicates and identities.Grammars, 4:2001, 2001.

[29] G. Rozenberg and A. Salomaa, editors.Handbook of Formal Lan-guages, volume 1. Springer, 1997.

[30] P. Saxena, D. Akhawe, S. Hanna, F. Mao, S. McCamant, andD. Song. A symbolic execution framework for JavaScript. Tech-nical Report UCB/EECS-2010-26, EECS Department, Universityof California, Berkeley, Mar 2010.

[31] P. Saxena, D. Akhawe, S. Hanna, S. McCamant, F. Mao, andD. Song. A symbolic execution framework for JavaScript. InProceedings of the IEEE Symposium on Security and Privacy,2010.

[32] P. Saxena, D. Molnar, and B. Livshits. ScriptGard: Prevent-ing script injection attacks in legacy Web applications with auto-matic sanitization. Technical Report MSR-TR-2010-128, Micro-soft Research, Sept. 2010.

[33] B. Schmidt. Google analytics XSS vulnerability,2011. http://spareclockcycles.org/2011/02/03/

google-analytics-xss-vulnerability/.

[34] M. Veanes, N. Bjørner, and L. de Moura. Symbolic automata

constraint solving. In C. Fermuller and A. Voronkov, editors,LPAR-17, volume 6397 ofLNCS, pages 640–654. Springer, 2010.

[35] M. Veanes, P. de Halleux, and N. Tillmann. Rex: SymbolicRegu-lar Expression Explorer. InProceedings of the International Con-ference on Software Testing, Verification and Validation, 2010.

[36] G. Wassermann and Z. Su. Sound and precise analysis of Webapplications for injection vulnerabilities. InProceedings of theConference on Programming Language Design and Implementa-tion, 2007.

[37] G. Wassermann, D. Yu, A. Chander, D. Dhurjati, H. Inamura, andZ. Su. Dynamic test input generation for Web applications. In

Proceedings of the International Symposium on Software Testingand Analysis, 2008.

[38] J. Williams. Personal communications, 2005.

[39] Y. Xie and A. Aiken. Static detection of security vulnerabilitiesin scripting languages. InProceedings of the Usenix SecuritySymposium, pages 179–192, 2006.

[40] L. Yuan, J. Mai, Z. Su, H. Chen, C.-N. Chuah, and P. Mohapa-tra. Fireman: A toolkit for firewall modeling and analysis. InProceedings of the Symposium on Security and Privacy, pages199–213, 2006.

Fast and Precise Sanitizer Analysis with BEK · 2018-01-04 · ing to spend a little more time on...

Documents

Transcript of Fast and Precise Sanitizer Analysis with BEK · 2018-01-04 · ing to spend a little more time on...