A streaming full regular expression parser · 2020-05-02 · A streaming full regular expression...

A streaming fullregular expression

parserMaster Thesis

Computer Science, University of Copenhagen

Supervisors: Fritz Henglein and Lasse Nielsen

9th of June 2011

Line Bie Pedersen <[email protected]>

Abstract

Regular expressions is a popular field. It has seen much researchover the years and many use regular expressions as a part of theirdaily routine. The uses are widely varied and range from the pro-grammer doing search and replace operations on source code to thebiologist looking for common patterns in amino acids. This meansthere is a rich supply of regular expression engine implementations,some are general purpose and some are geared for some specific pur-pose.

In this thesis we will present a design and a prototype of a regularexpression engine. It is able to match and extract the values of cap-tured groups. The design splits the process into several components.Our components are streaming and use constant memory for a fixedregular expression, with the exception of one non-streaming compo-nent. We also evaluate the results and compare our regular expressionengine with other regular expression engine implementations.

Resume

Regulære udtryk er et populært omrade. Over arene er der blevetforsket meget i dette emne og mange bruger dem som en del af deresdaglige rutine. Brugsomraderne er mangeartede, fra programmørender udfører søg og erstat operationer pa kildekode til biologen derleder efter mønstre i aminosyrer. Alt dette betyder at der er et rigtudvalg af forskellige implementationer af regulære udtryk, nogen eralmengyldige og andre er mere egnede til særlige formal.

I dette speciale vil vi præsentere et design og en prototype af enfortolker af regulære udtryk. Den er i stand til at genkende tekststrenge og udtrække værdier af grupper. Designet deler arbejdsbyr-den op i flere enkeltkomponenter. Vores komponenter er “stream-ing” og bruger konstant hukommelse for et fast regulært udtryk, medundtagelse af en enkelt komponent. Vi vil ogsa evaluere de opnaederesultater og sammenligne vores prototype med andre eksisterendeimplementationer.

i

CONTENTS

Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Definitions, conventions and notation . . . . . . . . . . . . . 11.3 Objectives and limitations . . . . . . . . . . . . . . . . . . . . 2

1.3.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . 21.3.2 Implementation Choices . . . . . . . . . . . . . . . . . 2

1.4 Summary of contributions . . . . . . . . . . . . . . . . . . . . 31.5 Thesis overview . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Regular expressions and finite automatons 42.1 Regular expressions . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Extensions to the regular expressions . . . . . . . . . . . . . . 52.3 Finite automatons . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 Regular expression to NFA . . . . . . . . . . . . . . . . . . . . 8

2.4.1 Thompson . . . . . . . . . . . . . . . . . . . . . . . . . 82.5 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Designing a memory efficient regular expression engine 143.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1.1 Constructing an NFA . . . . . . . . . . . . . . . . . . . 143.1.2 Dube and Feeley . . . . . . . . . . . . . . . . . . . . . 153.1.3 Bit-values and mixed bit-values . . . . . . . . . . . . 153.1.4 Splitting up the workload . . . . . . . . . . . . . . . . 163.1.5 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Protocol specification . . . . . . . . . . . . . . . . . . . . . . . 173.3 Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3.1 The ’match’ filter . . . . . . . . . . . . . . . . . . . . . 213.3.2 The ’trace’ filter . . . . . . . . . . . . . . . . . . . . . . 213.3.3 The ’groupings’ filter . . . . . . . . . . . . . . . . . . . 22

4 Implementing a regular expression engine 274.1 Regular expression to NFA . . . . . . . . . . . . . . . . . . . . 27

4.1.1 Character classes . . . . . . . . . . . . . . . . . . . . . 284.2 The simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.3 Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.3.1 Groupings . . . . . . . . . . . . . . . . . . . . . . . . . 314.3.2 Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.3.3 Serialize . . . . . . . . . . . . . . . . . . . . . . . . . . 32

ii

CONTENTS

5 Optimizations 335.1 Finding out where . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.1.1 Memory usage . . . . . . . . . . . . . . . . . . . . . . 335.1.2 Output sizes . . . . . . . . . . . . . . . . . . . . . . . . 335.1.3 Runtimes . . . . . . . . . . . . . . . . . . . . . . . . . 345.1.4 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.2 Applying the knowledge gained . . . . . . . . . . . . . . . . 375.2.1 ε-lookahead . . . . . . . . . . . . . . . . . . . . . . . . 375.2.2 Improved protocol encoding . . . . . . . . . . . . . . 385.2.3 Buffering input and output . . . . . . . . . . . . . . . 405.2.4 Channel management in trace . . . . . . . . . . . . . 40

6 Analysis of algorithms 426.1 Constructing the NFA . . . . . . . . . . . . . . . . . . . . . . 426.2 Simulating the NFA . . . . . . . . . . . . . . . . . . . . . . . . 436.3 The ’match’ filter . . . . . . . . . . . . . . . . . . . . . . . . . 436.4 The ’groupings’ filter . . . . . . . . . . . . . . . . . . . . . . . 466.5 The ’trace’ filter . . . . . . . . . . . . . . . . . . . . . . . . . . 466.6 The ’serialize’ filter . . . . . . . . . . . . . . . . . . . . . . . . 46

7 Evaluation 497.1 A backtracking worst-case . . . . . . . . . . . . . . . . . . . . 497.2 A DFA worst-case . . . . . . . . . . . . . . . . . . . . . . . . . 527.3 Extracting an email-address . . . . . . . . . . . . . . . . . . . 557.4 Extracting a number . . . . . . . . . . . . . . . . . . . . . . . 587.5 Large files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617.6 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

8 Related work 638.1 Constructing NFAs . . . . . . . . . . . . . . . . . . . . . . . . 638.2 Simulating NFAs . . . . . . . . . . . . . . . . . . . . . . . . . 63

8.2.1 Frisch and Cardelli . . . . . . . . . . . . . . . . . . . . 638.2.2 Backtracking . . . . . . . . . . . . . . . . . . . . . . . 63

8.3 Virtual machine . . . . . . . . . . . . . . . . . . . . . . . . . . 65

9 Future work 689.1 Extending the current regular expression feature-set . . . . . 689.2 Internationalization . . . . . . . . . . . . . . . . . . . . . . . . 699.3 More and better filters . . . . . . . . . . . . . . . . . . . . . . 699.4 Concurrency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

10 Conclusion 71

A Test computer specifications 74

iii

CONTENTS

B Huffman trees 74

C Experiments 74

D Optimization scripts 74

E Benchmark scripts 81

F Source code 97

iv

LIST OF FIGURES

List of Figures

1 Fragment accepting a single character a . . . . . . . . . . . . 82 Fragment accepting the empty string . . . . . . . . . . . . . . 83 Alternation R|S . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Concatenation RS . . . . . . . . . . . . . . . . . . . . . . . . . 95 Repetition R* . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Individual fragments when converting a*b|c to NFA . . . . . 117 NFA for the regular expression a*b|aab . . . . . . . . . . . . 128 Architecture outline . . . . . . . . . . . . . . . . . . . . . . . . 189 Automaton with bitvalues for regular expression a* . . . . . 1910 Capturing under alternation . . . . . . . . . . . . . . . . . . . 2411 A simple character class-transition example . . . . . . . . . . 2912 ε-cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3013 Output of gprof running main . . . . . . . . . . . . . . . . . 3514 Output of gprof running groupings all . . . . . . . . . . 3615 Output of gprof running trace . . . . . . . . . . . . . . . . 3616 Output of gprof running serialize . . . . . . . . . . . . . 3717 A backtracking worst-case: Main, Tcl and RE2 runtimes. . . 5018 A backtracking worst-case: Perl runtime on a logarithmic scale. 5119 A backtracking worst-case: Total, Tcl and RE2 memory usage. 5120 A backtracking worst-case: Perl memory usage. . . . . . . . 5221 A DFA worst-case: Main, RE2 and Perl runtimes. . . . . . . . 5322 A DFA worst-case: Tcl runtimes. . . . . . . . . . . . . . . . . 5323 A DFA worst-case: Main and RE2 memory usage. . . . . . . 5424 A DFA worst-case: Perl memory usage. . . . . . . . . . . . . 5425 A DFA worst-case: Tcl memory usage. . . . . . . . . . . . . . 5526 Extracting an email-address: Runtimes. . . . . . . . . . . . . 5627 Extracting an email-address: Memory usage. . . . . . . . . . 5628 Extracting an email-address: Individual programs memory

usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5729 Extracting an email-address: Sizes of output from individual

programs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5730 Extracting an email-address: The relationship between input

size to trace and size of input string. . . . . . . . . . . . . . 5831 Extracting a number: Runtimes. . . . . . . . . . . . . . . . . . 5932 Extracting a number: Memory usage. . . . . . . . . . . . . . 5933 Extracting a number: Individual programs memory usage. . 6034 Extracting a number: Sizes of output from individual pro-

grams. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6035 Extracting a number: The relationship between input size to

trace and size of input string. . . . . . . . . . . . . . . . . . 6136 NFA for a* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6437 Huffman tree for frequencies in table 4, .* . . . . . . . . . . . 75

v

LIST OF FIGURES

38 Huffman tree for frequencies in table 4, (?:(?:(?:[a-zA-Z]+?)+[,.;:] ?)*..)* . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

vi

LIST OF TABLES

List of Tables

1 Peak memory usage . . . . . . . . . . . . . . . . . . . . . . . . 342 Sizes of output . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 Runtimes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 Frequencies of operators in a mixed bit-value string . . . . . 385 Huffman encoding . . . . . . . . . . . . . . . . . . . . . . . . 396 Runtimes for different buffer sizes . . . . . . . . . . . . . . . 397 Runtimes for different buffer sizes using non-thread-safe func-

tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408 Analysis of re2nfa (part 1) . . . . . . . . . . . . . . . . . . . 439 Analysis of re2nfa (part 2) . . . . . . . . . . . . . . . . . . . 4410 Analysis of re2nfa (part 3) . . . . . . . . . . . . . . . . . . . 4511 Analysis of match . . . . . . . . . . . . . . . . . . . . . . . . 4512 Analysis of the ’groupings’ filter . . . . . . . . . . . . . . . . 4713 Analysis of the ’trace’ filter . . . . . . . . . . . . . . . . . . . . 4814 Analysis of the ’serialize filter . . . . . . . . . . . . . . . . . . 4815 Code sequences . . . . . . . . . . . . . . . . . . . . . . . . . . 66

vii

LIST OF DEFINITIONS

List of definitions

1 Definition (Regular language) . . . . . . . . . . . . . . . . . . 42 Definition (Regular expression) . . . . . . . . . . . . . . . . . 43 Definition (The groupings filter rewriting function) . . . . . 26

viii

LIST OF EXAMPLES

List of examples

1 Example (Regular expression) . . . . . . . . . . . . . . . . . . 52 Example (Extensions of regular expressions) . . . . . . . . . 63 Example (Converting a regular expression to a NFA) . . . . 104 Example (Matching with a NFA) . . . . . . . . . . . . . . . . 125 Example (Protocol) . . . . . . . . . . . . . . . . . . . . . . . . 196 Example (The ’match’ filter) . . . . . . . . . . . . . . . . . . . 217 Example (The ’trace’ filter) . . . . . . . . . . . . . . . . . . . . 228 Example (Simple groupings filter) . . . . . . . . . . . . . . . 229 Example (Capturing under alternation) . . . . . . . . . . . . 239 Example (continuing from p. 23) . . . . . . . . . . . . . . . . 2410 Example (The groupings filter rewriting function) . . . . . . 2611 Example (Backtracking) . . . . . . . . . . . . . . . . . . . . . 6412 Example (Virtual machine) . . . . . . . . . . . . . . . . . . . . 66

ix

LIST OF EXAMPLES

Acknowledgements

We would like to thank all those that have been involved in thisgreat adventure. A few people deserve a special mentioning:

First of all we would like to thank Fritz Henglein and Lasse Nielsenwho have been thesis advisers on this project.

Jan Wiberg deserve a mentioning for tireless support and proof-reading.

We would also like to thank Carl Christoffer Hall-Frederiksen for gen-eral support and believing in us.

x

1 Introduction

1 Introduction

Regular expressions is an important tool for matching strings of text. Inmany text editors and programming languages they provide a concise andflexible way of searching and manipulating text. They find their use inareas like data mining, spam detection, deep packet inspection and in theanalysis of protein sequences, see [12].This masters thesis presents a design and an implementation of a regularexpression engine with focus on memory consumption. Also included is anevaluation of the work and a discussion about possible future extensions.

1.1 Motivation

Regular expressions is a popular area in computer science and has seenmuch research. They are used extensively both in academia and in busi-ness. Many programming language offer regular expressions in some form,either as an embedded feature or as a stand alone library. There are manydifferent flavors of regular expression and implementations, each adaptedto some purpose.

Challenges and desired outcome Many of the existing solutions gives noguarantees on their memory consumption. In this project we will focus ona streaming solution, that is we will, where possible, use a constant amountof memory for a fixed regular expression. We will build a general frame-work to this purpose. In addition we wish to attempt to isolate the stepstaken in matching a regular expression with a string. We do this becausewe can then plug in exactly the steps needed for a particular operation andleave out the rest. Another reason for doing this is that, it is then possibleto isolate the trouble spots where optimization is most needed.

1.2 Definitions, conventions and notation

The empty string is denoted as ε. Σ is used to denote the alphabet, or set ofsymbols, used to write a string or a regular expression.Automatons are represented as graphs, where states are nodes and tran-sitions are edges. The start state has an arrow starting nowhere pointingto it. The accepting state is marked with double circles. Edges has an at-tached string, indicating on which input symbol this particular transitionis allowed.Regular expressions will be written in sans serif font: a|b and strings willbe written slanted: The cake is a lie.

1

1 Introduction

1.3 Objectives and limitations

The objectives of this thesis is to extend existing theory and design, imple-ment and evaluate a prototype. We will be extending theory by Dube andFeeley [6] and Henglein and Nielsen [8]. The extended theory will be usedin designing a streaming regular expression engine. The design will be im-plemented in a prototype and finally we will evaluate and compare withexisting solutions.We aim to address these topics in this thesis:

• Extend existing theory by Dube and Feeley [6] and Henglein andNielsen [8].

• Create a prototype implementation

• Compare the prototype with existing solutions and evaluate our ownresults

• Conclude, and propose extensions and improvements on the work

1.3.1 Limitations

The focus is on designing and implementing a streaming regular expres-sion engine. There are many general purpose features and optimizationsthat can be considered necessary in a full-fledged regular expression en-gine that we only consider peripherally here. In situations where we arefaced with a choice, we have generally favored simplicity and robustness.In several cases, we only theoretically discuss alternative solutions and donot provide a prototype.

1.3.2 Implementation Choices

Some choices were made early in the planning phase. This includes thechoice of programming language: C. There are several good reasons forchoosing C: We have previous experience developing in this language, ithas a low memory and run time overhead and libraries are well docu-mented and tested. The obvious drawback of using C is, as always, thatit is a primitive language, making it a time consuming process developingnew programs compared to other more high level languages.Similarly, we chose to develop the program on a Linux platform for tworeasons; first, it is a natively supported platform for many other regular ex-pression libraries, which makes comparisons simpler, and lastly for prac-tical purposes, it is a free and highly rich platform that the authors werealready familiar with.Other choices that were made early was to build on the work of Dube andFeeley and Henglein and Nielsen, especially for the mixed bit-values con-cept. This will be treated in more detail later.

2

1 Introduction

1.4 Summary of contributions

The main contribution of this thesis work is a streaming regular expressionengine based on Dube and Feeley [6] and Henglein and Nielsen [8]. Wepresent, implement and evaluate a working prototype that demonstratesthat our solution is both technically viable and in many cases preferablefrom a resource-consumption standpoint compared to existing industry so-lutions.

1.5 Thesis overview

Section 2 gives an introduction to regular expressions and finite automa-tons. In section 3 we describe the architecture of our implementation. Sec-tion 4 has the implementation specific details. In section 5 we describethe behavior of the implemented prototype, suggest some optimizationsand describe how we implemented some of them. In section 6 we havethe complexity analysis. Section 7 compares our implementation to exist-ing implementations. In section 8 we have the related work. Section 9describes future work, improvements to the design and the implementa-tion of the prototype. Lastly we have the conclusion on our theoretical andpractical work in section 10.

3

2 Regular expressions and finite automatons


A regular language is a possibly infinite set of finite sequences of symbolsfrom a finite alphabet. It is a formal language that must fulfill a number ofproperties. We provide this formal definition of regular languages:

Definition 1 (Regular language). The regular language over the alphabetΣ is defined recursively as:

• The empty language ∅.

• The empty string language {ε}.

• The singleton language {a}, for any symbol a ∈ Σ.

• If Lr and Ls are both regular languages then the union Lr ∪Ls is alsoa regular language.

• If Lr and Ls are both regular languages then the concatenation Lr •Ls

is also a regular language.

• if L is a regular language then the Kleene star1 L∗ is also a regularlanguage.

∗

2.1 Regular expressions

Regular expressions are written in a formal language consisting of twotypes of characters: meta, and literal characters. The meta characters havespecial meaning and are interpreted by a regular expression engine. Someof the basic meta characters include parenthesis, the alternation operatorand the Kleene star. Parenthesis provides grouping, alternation allows thechoice between different text strings and the Kleene star repeats. The literalcharacters have no special meaning; they simply match literally. Regularexpressions is used to describe a regular language.A formal definition of regular expressions:

Definition 2 (Regular expression). A regular expression over an alphabetΣ can be defined as follows:

• An empty string, ε, and any character from the alphabet Σ

• If r1 and r2 are regular expressions, then the concatenation r1r2 arealso a regular expression

1We use the conventional definition of the Kleene star; a unary operator meaning “zeroor more”

4


• If r1 and r2 are regular expressions, then the alternation r1|r2 is also aregular expression

• If r is a regular expression, then so is the repetition r∗

Any expression is a regular expression if it follows from a finite number ofapplications of the above rules. ∗

The precedence of the operators are: repetition, concatenation and alterna-tion, from highest to lowest. Concatenation and alternation are both left-associative.

Example 1 (Regular expression). Here we have a somewhat complicatedexample of a regular expression that demonstrates the basic operators. Con-sider the sentence:

This book was written using 100% recycled words.2

Other writings such as papers and novels also use words. If we want tocatch sentences referring to these writings as well, we can use the regularexpression: (book|paper|novel).To match the number 100 in the sentence, we could use the regular ex-pression 100. In most cases however we will not know beforehand howmany words are recycled, so we may want to use the regular expression(0—1—2—3—4—5—6—7—8—9)*, which will match any natural number.With this in mind we can write a regular expression to match the desiredsentences:

This (book|paper|novel) was written using(0|1|2|3|4|5|6|7|8|9)*% recycled words.

∗

2.2 Extensions to the regular expressions

Many tools extend the regular expressions presented in the previous sec-tion. A typical extension is new notation to make it easier to specify pat-terns. In this section we present the extensions to definition 2 on the preced-ing page we have made: Additional quantifiers, character classes, a quotingsymbol, a wild card and non-capturing parenthesis.

• The quantifier + causes the regular expression r to be matched one ormore times. This can also be written as rr∗

• The quantifier ? causes the regular expression r to be matched zero orone times. This can also be written as ε|r

2Terry Pratchett, Wyrd Sisters

5


• A character class is delimited by [] and matches exactly one characterin the input string. Special characters loose their meaning inside acharacter class; ∗, +, ?, (, ) and so on are treated as literals.

Characters can be listed individually, e.g. [abc], or they can belisted as ranges with the range operator: −, e.g. [a-z]. These canbe rewritten in terms of our original regular expression: a|b|c anda|b|c...x|y|z respectively.

To match characters not within the range, the complement operatoris used. ˆ used as the first character in a character class, elsewhere itwill simply match literally, indicates that only characters not listed inthe character class should match. E.g. [ˆˆ] will match anything buta ˆ

• The quoting character \will allow the operators to match literally. Weuse \* to match a *.

• The wild card . will match any character, including a newline.

• For the non-capturing parenthesis we have the choice of notation.Here we will list some of the options, where r is some regular ex-pression:

– The industry standard, to which Perl, Python, RE2 and most oth-ers adhere is: (? : r).

– Perl 6 [15] suggests use of square parenthesis instead: [r]. Theseare however already in use by the character classes.

– A more intuitive notation could be using single parenthesis fornon-capturing, (r), and double parenthesis for capturing, ((r)).

– A currently unused option is {r} as special notation, which wouldbe simple to implement. This is however the industry standardfor repetition notation.

Since there is a standard, we will adhere to it, and use (? : r) fornon-capturing parenthesis.

Example 2 (Extensions of regular expressions). As we saw in example 1,we can match a natural number with the regular expression (0—1—2—3—4—5—6—7—8—9)*.Using the expansions to regular expressions above, we can rewrite this as:

[0-9]* This literally means the same thing.

[0-9]+ We can use a different repetition operator and require there be atleast one digit.

6


[1-9][0-9]* This matches any natural number as well, but it will notmatch any preceding zeros. This is a refinement, in that it will matchfewer text strings than the first expression. It is up to the expressionwriter to decide what the desired outcome is.

∗

2.3 Finite automatons

Finite automatons are used to solve a wide array of problems. In this thesiswe will focus on finite automatons as they are used with regular expres-sions. A finite automaton consists of a number of states and transitionsbetween states. It is constructed as follows:

• One state is marked as the initial state

• a set of states, zero or more, is marked as final

• A condition is attached to each transition between states

• Input is consumed in sequence and for each symbol transitions aretaken when their attached condition are met

• If the simulation ends in a final state, the finite automaton is said toaccept the input

Finite automatons can be divided in two main categories: The deterministic(DFA) and the non-deterministic (NFA) finite automaton. This distinctionis mostly relevant in practice, as they are equivalent in terms of computingpower. NFAs and DFAs recognize exactly the regular languages.

NFA For each pair of input symbol and state, there may be more than onenext states. This means that there may be several paths through anNFA for a given input string.

The ε-transitions are an extension of the NFA. These are special tran-sitions that can be taken without consuming any input symbols. Thisalso has mainly practical implications, NFAs with and without ε-transitions are equivalent in computing power.

DFA For each pair of input symbol and state, there may be only one nextstate. This means there is only one path through the DFA for a giveninput string.

The advantage of a NFA is the size, the number of states and transitions ina NFA is linear in the size of the regular expression, whereas a DFA will inthe worst case have a number of states exponential in the size of the regularexpression. The advantage of a DFA is the effort required to simulate it. It

7


a

Figure 1: Fragment accepting a single character a

ε

Figure 2: Fragment accepting the empty string

only requires time linear in the size of the input string and constant space(not counting the space for the DFA), whereas the NFA requires time linearin the product of the size of the regular expression and input string andspace linear in the size of the regular expression.

2.4 Regular expression to NFA

Every regular expressions can be converted to a NFA matching the samelanguage. This section will describe an approach to doing so.

2.4.1 Thompson

The method described in this section first appeared in Ken Thompsons arti-cle from 1968 [13]. The descriptions given in for example [9], [1] and [4] areconsidered more readable and we will be basing our description on these.The NFA will be build in steps from smaller NFA fragments. A NFA frag-ment has an initial state, but no accepting state, instead it has one or moredangling edges leading nowhere (yet).The base fragment corresponds to the regular expression consisting only ofa single character a. The NFA fragment is shown in figure 1. One state witha single edge, marked with the character a is added. The new state is theinitial state for this fragment and the edge is left dangling.The second base fragment corresponds to the empty regular expression.The NFA fragment is shown in figure 2. One state with a single edgemarked as a ε-edge is added. The new state is the initial state for this frag-ment and the edge is left dangling. This fragment is used for the emptyregular expression and for alternations with one or more options left empty.The first compound fragment is alternation, see figure 3 on the next page.Here, the two sub-fragments R and S are automatons with initial states andsome dangling edges. What else they are composed of, is irrelevant forthe moment. We add one new state, and make it the initial state for thisfragment. The initial state has two ε-edges leaving, connecting to the initialstates of R and S. The dangling edges for the new fragment is the sum ofthe dangling edges leaving R and S.

8


R

S

ε

ε

Figure 3: Alternation R|S

R S

Figure 4: Concatenation RS

Concatenation of two regular expressions R and S is achieved as shown infigure 4. The dangling edges of R is connected to the initial state of S. Theinitial state for the new fragment is the initial state of R and the danglingedges of S is still left dangling.Zero or more times repetition is shown in figure 5. One new, initial, state isadded. It has two ε-edges leaving, one is connected to the initial state of Rand one is left dangling. The dangling edges of R is connected to the newinitial state.Finalizing the process of constructing a NFA, we patch in an acceptingstate. All dangling edges in the end fragment are connected to the accept-

Rε

ε

Figure 5: Repetition R*

9


ing state.

Properties NFAs created with Thompsons method has these properties:

• At most two edges is leaving a state

• There are no edges leaving the accepting state

• There are no edges leading into the starting state

These properties, specifically the first one, are the reasons why we chooseto use NFAs and this specific method of generating them.

Example 3 (Converting a regular expression to a NFA). In this examplewe will be converting the regular expression a*b|c to a NFA using Thomp-sons method.

• Top level we have the alternation operator, but before we can com-plete this fragment, we need to convert a*b and c to fragments.

– a*b is complicated since we have one operator, two literals anda hidden concatenation. Top level we have the concatenationoperator, concatenating a* and b. These needs to be convertedbefore we can concatenate.

∗ a* needs to be broken further down. Top level we have theKleene star, but we can not apply the rule for converting thisto a NFA fragment before we have converted a.· a is straightforward, we just apply the rule for trans-

forming literals and we have the fragment in figure 6(a)Using this fragment to complete the Kleene star, we havethe fragment in figure 6(d).∗ b is straightforward, we just apply the rule for transforming

literals and we have the fragment in figure 6(b).

Now we are ready to concatenate, fragments 6(d) and 6(b) areconcatenated and we have the fragment in figure 6(e)

– c is straightforward, we just apply the rule for transforming lit-erals and we have the fragment in figure 6(c).

With these expressions converted to fragments we can apply the al-ternation conversion rule. We have the resulting fragment in figure6(f)

All that is left now is to connect the dangling edges to an accepting state.We have the final result in figure 6(g) on the following page

∗

10


a

(a) Fragment for a

b

(b) Fragment for b

c

(c) Fragment for c

ε

ε

a

(d) Fragment for a*

ε

ε

a

b

(e) Fragment for a*b

ε

ε

a

b

c

ε

ε

(f) Fragment for a*b|c

ε

ε

a

b

c

ε

ε

(g) Final NFA for a*b|c

Figure 6: Individual fragments when converting a*b|c to NFA

11


1

2ε

5

ε

3ε

4

ε

a

8

b

6a

7a

b

Figure 7: NFA for the regular expression a*b|aab

2.5 Matching

The NFAs constructed as described in section 2.4 on page 8 can be usedto match a regular expression with a string, i.e. to determine if a stringbelongs to the language of the regular expression.Once the NFA is generated, simulating it is a straightforward task. Again,our method is attributed to Thompson [13].

1. We maintain a set of active states and a pointer to the current charac-ter in the string

2. At the beginning only the start state belongs to the set of active states

3. The string is read from left to right, taking each character in turn.When a character is read from the input string, all legal transitionsfrom the states in the active set is followed

4. A transition is legal if it is a ε-transition or if the mark on the transi-tion matches the character read from the input string. The new set ofactive states is the set of end states for the transitions followed

5. If the accepting state is included in the active set when the string isread, the string matches the regular expression

With this method we only ever add a state to the active set once per itera-tion and we only read each character from the input string once.

Example 4 (Matching with a NFA). In this example we will demonstratehow the regular expression a*b|aab is matched with the string aab. In fig-ure 7 we have the corresponding NFA. Each state is marked with a uniquenumber which we will be referring to in the table below.

12


Active set SP Explanation1 aab Initially we have the start state in the active set and

SP points to the start of the string.3, 4, 5 aab Following all ε-transitions.

2, 6 aab Reading the first a from the input string, states 3and 5 have legal transitions on a.

3, 4, 6 aab Following all ε-transitions.2, 7 aab Reading the second a from the input string, states

3 and 6 have legal transitions on a.3, 4, 6 aab Following all ε-transitions.

8 aab Reading the last character from input string: b,states 4 and 7 have legal transitions on b.

8 aab No ε-transitions to follow.

After reading the string we can see that the accepting state is in the activeset: We have a match!

∗

2.6 Summary

Regular expressions are a widely used and popular tool. The features of-fered and the semantics vary. For example some will offer back referencingand others will not, some will offer a leftmost match in alternations, oth-ers will offer a longest match. Even for engines with similar feature sets,the underlying implementation and performance can vary widely. A reg-ular expression engine can typically solve some types of problems moreefficiently than others, or vice versa: it may be particularly bad at a givenproblem.There are many highly specialized regular expression engines exemplifyingthis. To briefly mention an example: Structured text like SGML documentsbenefits from a different approach than most industry standard enginesuse. Many times you will need to find the text between two tags, but manytools are not geared for this kind of search: The search can span severallines and we will usually want a shortest match. See [12] for details.To the knowledge of the writers, there is no others pursuing a regular ex-pression engine build on the design described in this thesis. The divisionof the workload in several different components using parse trees to com-municate progress is unique. What we hope by this approach is flexibilityand a guaranteed upper bound on memory consumed for a match.

13

3 Designing a memory efficient regular expression engine

3 Designing a memory efficient regular expression en-gine

In the following we will describe our design and the alternatives we con-sidered. First we will have some general reflections on the overall designwhich we will build on in a discussion about alternatives and solutions. Atthe end of this section we will have described our chosen solution and theindividual components.

3.1 Architecture

In this thesis we will build a general framework for matching regular ex-pressions with strings. Our vision is a flexible architecture where the user isin control. Regular expression matching is a sequence of operations, wherenot all operations are needed at all times. This leads to the idea that wecan split the regular expression engine into several dedicated parts. Thiscan be demonstrated by considering the tasks of simple acceptance andextractions of groupings, the first only reports if a string matches a regu-lar expression and the latter will also report on any groupings. By pullingthis functionality out of the regular expression engine, we make the job ofreporting simple acceptance simpler.Before moving on, there are some prerequisites that must be discussed.This leads us to a discussion on possible mechanisms that would allow usto separate each task. We require several things: a mechanism to constructa NFA and a compact means of passing on the current state of the matchprocess for each task.

3.1.1 Constructing an NFA

In this thesis we have chosen to use Thompsons method of constructingNFAs. The NFAs constructed in this manner exhibit desirable properties:All states has no more than two outgoing transitions and the number ofstates grows linear in the size of the regular expression. Typically youwould take this one step further and in some way build a DFA from theNFA, since these has much better traversal properties. We will not be do-ing this; the worst-case behavior of building a DFA is exponential both intime and space, as we will see in the evaluation section and they will gener-ally not have two outgoing edges per state. Particularly the last part aboutthe outgoing edges makes us chose the NFA over the DFA, the worst-casebehavior will in practice very rarely happen.

14


3.1.2 Dube and Feeley

One way of communicating the current state of the match process, wouldbe to send the whole parse tree. An efficient algorithm for parsing withregular expressions is presented by Dube and Feeley in their paper from2000 [6]. The algorithm produces a parse tree, describing how string wmatches regular expression r. For a fixed regular expression the algorithmruns in time linear in the size of w.To build the parse tree, we first construct a NFA corresponding to r. Thearticle specifies a method for construction, but this can be any NFA con-structed so that the number of states is linear in the length of r, this includesthose constructed with Thompsons method[13]. This restriction ensures therun time complexity. Until this point there is no difference from a standardNFA, but Dube and Feeley then add strings to some of the edges. Thesestrings are outputted whenever the associated edge is followed. When theoutputted strings are then read in order they form a parse tree.The idea of having output attached to edges is further developed in the pa-per [8]. The parse trees Dube and Feeleys method yields are rather verboseand can be more compactly represented: Whenever a node has more thanone outgoing edge, a string is added to the edge, containing just enoughinformation to decide which edge was taken.NFA simulation with Dube and Feeleys algorithm takes up takes up spacelinear in the regular expression. We need to allocate space for the NFA andfor a list of active states, both use space linear in the regular expression.Added to this is the storage requirements for output, which will take upspace linear in the product of the size of the regular expression and theinput string: For each input character we can at most take every transitiononce. This is the same asymptotic behavior as the compacted version fromHenglein and Nielsens paper.So the total memory cost, counting both the simulation phase and savingthe output, is linear in the product of the size of the regular expression andthe input string.

3.1.3 Bit-values and mixed bit-values

Henglein and Nielsen introduce the notion of bit-values in [8]. A bit-valueis a compact representation of how a string matches a regular expression.In itself it is just a sequence of 0s and 1s and has no meaning without theassociated regular expression. The actual bit-value for a string is not uniqueand will depend on the choice of regular expression. If the regular expres-sion is ambiguous and matches the string in more than one way, there willalso be more than one sequence, or bit-value, for this combination of stringand regular expression.When relying on the property of a Thompsons NFA, that no state has more

15


than two outgoing transitions, we have a perfect mapping for the bit-values.Instead of mapping syntax tree constructors to bit-values, we will map theoutgoing transitions in split-states to bit-values. Each time we are facedwith a choice when traversing the NFA, we will record that choice with abit-value. This will enable us to recreate the exact path through the NFA.See also [6].For reasons we will discuss in more detail later, we introduce the notionof mixed bit-values. When simulating the NFA we will simultaneously becreating many bit-values which may or may not end up in an actual match.These individual bit-values will be referred to as a channel. Mixed bit-values is the set of all these channels and they are simply a way of talkingabout multiple paths through the NFA.

3.1.4 Splitting up the workload

We have now introduced the bit-values. The bit-values enables us to splitup the work in several tasks.

• The first task will be to create the mixed bit-values describing thepaths the Thompson matching algorithm takes through the NFA. Thefirst task will need the regular expression to form the NFA and thestring for the matching. Note that there is no need to store the wholestring, the matching processes the characters in the input string in astreaming fashion.

• The next and also last step in a simple acceptance match, would beto check the mixed bit-values for a match. Simply scan the bit-valuesfor acceptance.

• In extracting the values of groupings, we would need more tasks. Wecould form a task that cuts away unneeded parts of the parse tree.Only the parts concerned with contents of the groupings would beneeded to actually extract the values. To do this we would requirethe regular expression to form an NFA annotated so that we couldrecognize the relevant parts of the syntax tree.

• We have a stream of mixed bit-values. It would be necessary at somepoint to extract the channel that makes up the actual match, if there isone. This can not be done in a streaming fashion. When first encoun-tering a new channel, we need to know whether or not it has a match.The only way to know this is to read the whole stream of mixed bit-values. This task would only need the stream of mixed bit-values, ithas no need for the regular expression.

• The last step in extracting the values of the groupings would be tooutput the actual values in some format. To do this we would require

16


the bit-values from the match and the regular expression. The regularexpression will have to be adjusted to fit the bit-values outputted bythe groupings filter.

3.1.5 Solutions

There are two main methods of realizing this design. We can make thetasks be small separate programs that communicate through pipes or wecan make one program where the tasks will be processes that communi-cate through some inter-process communication model. The separate pro-grams model has the advantage of being simpler, in that the communi-cation framework is already in place, we would not have to worry aboutsynchronization and such. The processes model would probably have theadvantage of being much faster in communicating and a generally loweroverhead, all according to which model for inter-process communicationwas chosen. We have chosen the separate programs model, because of theease with which you can combine the separate programs and the much sim-pler communication model. This also opens up for the possibility to storethe output from one task for later use, or perhaps even piping the outputto a completely different system with for example netcat.The tasks will in some sense be projections performed on the mixed bit-values and the bit-values. The programs will therefore be called filters.

We present the overall architecture in figure 8 on the next page. In thefirst program we have the matcher, this program will take the regular ex-pression as an argument and have the string piped in and will output themixed bit-values that comprises the match. The second program will takethe mixed bit-values from the first and filter out those mixed bit-valuesrelevant to the capturing groupings only. The third program takes mixedbit-values and filters out the bit-values relevant to the actual match, if thereis no match, the output will be the empty string. The fourth program takesbit-values and constructs the string that was matched with those bit-values.If you rewrite the regular expression so that it only consists of the capturinggroups and adjust your bit-values accordingly, this will result in the captur-ing groups being outputted. We have put in a fifth, hypothetical, programin the design to signal that you could have more filters and place themanywhere in the chain (though there are some common sense limitations).

3.2 Protocol specification

In this section we will define a protocol that can communicate informationbetween our programs. The information consists of the mixed bit-valuesgenerated by the NFA simulator and the filters.

17


Actor

Match Group Trace other filters

Reg-ex

Serialize

Text stream

Mixed bit values

Bit valuesText output

Rewritten reg-exReg-ex

Figure 8: Architecture outline

• The protocol should enable us to recreate paths taken through anNFA.

• The protocol should be one-way. Information can only flow in onedirection.

• This protocol is intended to communicate between programs wherewe can expect perfect synchronization and unambiguity. For examplewill it not be necessary to include any error correction.

• The protocol will be text based, primarily to ease development anddebugging. It is entirely feasible to later replace this with a binaryprotocol.

Our protocol is very compact. The actual implementation may use differentsymbols to represent the operators below. We need our protocol to supportthe following operators. A description is supplied for each.

| The end of the channel list is reached and we should set the active channelto the first channel. This coincides with reading a new character. Itis not a strictly necessary operator, we can make do with the changechannels action. We choose to keep a separate action for end of list,because it adds to readability and redundancy.

: Whenever we change channels we put a :. There may be more than oneor perhaps even no bits output on a channel for any given characterfrom the string

= Copying of a channel. One channel is split into two, the paths takenthrough the NFA will be identical up to the point of splitting. Thenewly created channel is put in front of the rest of the channels

18


1

2ε_0

3

ε_1

a

Figure 9: Automaton with bitvalues for regular expression a*

0,1 The actual bit values.

\a The character classes is a special case. To later be able to recreate theexact string that we matched, we will need to know which charactera character class matched. To meet this requirement we will outputthe character we matched the character class with in the output. Tosignal such a character is coming we use an escape \.

b A channel is abandoned with no match

t A channel has a match

Example 5 (Protocol). In figure 9 we have an automaton for regular ex-pression a*. When matching this regular expression with the string aa wegenerate some mixed bit-values. This example will in detail demonstratehow the mixed bit-values are generated.

Initial step: Initially the start state of the automaton is added to the activelist. All ε-edges are followed and the following is output:

1. Node 1 is a split-node, a = is output, and we follow the ε-edgeto node 2 and output a 0. We can not make any further progresson this channel. We output a : and switch to the next channel.Output so far: =0:.List of active channels: {2, 1}.

2. The active channel is now in node 1, we follow the ε-edge fromnode 1 to 3 and output a 1. We can not make any further progresson this channel. This is the last channel in the channel list, so weoutput a | and reset the active channel.Output so far: =0:1|.List of active channels: {2, 3}.

First a is read 1. Node 2 has a transition marked a, we follow this backto node 1. Node 1 is a split-node, a = is output, and we follow the

19


ε-edge to node 2 and output a 0. We can not make any furtherprogress on this channel. We output a : and switch to the nextchannel.Output so far: =0:1|=0:.List of active channels: {2, 1, 3}.

2. The active channel is now in node 1, we follow the ε-edge fromnode 1 to 3 and output a 1. We can not make any further progresson this channel. We output a : and switch to the next channel.Output so far: =0:1|=0:1:.List of active channels: {2, 3, 3}.

3. Node 3 is the accepting node and does not have any transitions.We abandon this channel and output a b. This is the last channelin the channel list, so we output a | and reset the active channel.Output so far: =0:1|=0:1:b|.List of active channels: {2, 3}.

Second a is read This is the final step.

1. From node 2 we can make a transition on a back to node 1. Thisis a split node, so we output a = and transition on the the ε-edge to node 2 and output a 0. We can not do further transitionsand this is not the accepting node, we abandon this channel andoutput a b. We switch to the next channel and output a :.Output so far: =0:1|=0:1:b|=0b:.List of active channels: {1, 3}.

2. Node 1 has a ε-transition to node 3, we take it and output a 1.We can not do further transitions and since this is the acceptingnode, we output a t. We have one channel left, so we output a :and switch.Output so far: =0:1|=0:1:b|=0b:1t:.List of active channels: {3}.

3. Node 3 has no available transitions. We abandon this channeland output a b.Output so far: =0:1|=0:1:b|=0b:1t:b.List of active channels: {}.

∗

3.3 Filters

We have now established that filters are stand-alone programs that takesinput, performs some projection and outputs the result. This leads us to amore detailed description of the filters developed for this thesis.

20


The filters can be combined to make a whole; though naturally some com-binations makes more sense than others. For example it generally makessense to put filters that remove unnecessary information early in the stream,to reduce the input data-sizes for downstream filters.

3.3.1 The ’match’ filter

Input Any mixed bit-values or bit-values

Output A single value indicating match or no match.

This is a simple filter. The input is scanned for a t control character, ifpresent we output a t otherwise we output a b. In the case of empty input,we will output an error message, this is because the empty input is mostlikely due to an error in the previous programs. To save time on processing,we will assume the input format is correct.

Example 6 (The ’match’ filter). The regular expression a* matches thestring aaa :

$ echo -n ’aaa’ | ./main ’a*’ | ./ismatcht

a* does not match bbb :

$ echo -n ’bbb’ | ./main ’a*’ | ./ismatchb

Since we do not check the correctness of the input, the sentence: “the cakeis a lie” which is clearly not in the correct input format with regards to theprotocol defined section 3.2 on page 17, will also produce a positive answerfrom the filter:

$ echo -n ’the cake is a lie’ | ./ismatcht

∗

3.3.2 The ’trace’ filter

Input Mixed bit-values

Output Bit-values

The mixed bit-values is a way of keeping track of multiple paths throughthe NFA. This filter will remove all channels from the mixed bit-values, ex-cept the one that has a match. We are using Thompsons method for match-ing, so we can be sure there is at most one channel with a match.

21


This will be a non-streaming filter. This problem can not be solved withoutin some way storing the mixed bit-values: We need knowledge of whetheror not a channel has a match at the beginning, but we will not have thatknowledge until the end.

Example 7 (The ’trace’ filter). In the previous example we saw that theregular expression a* matches the string aaa. The NFA for the regularexpression is in figure 9 on page 19, marked with state numbers and bit-values. This particular match will generate the following mixed bit-values:=0:1|=0:1:b|=0:1:b|=0b:1t:b. The filter should then only return thebit-values 0001, which represents the match. The filter should return theempty string if there is no match. ∗

3.3.3 The ’groupings’ filter

Input Mixed bit-values

Output Mixed bit-values for rewritten regular expression

This filter facilitates reporting the content of captured groups. The filteroutputs the mixed bit-values associated with the groupings. By this wemean that all mixed bit-values generated while inside a captured groupshould be sent to output and all mixed bit-values generated outside a groupshould be thrown away. By throwing away the unnecessary bit-values wehope to make the mixed bit-values sequence shorter. This will be an advan-tage when the time comes to apply the trace filter, which is non-streaming,described in section 3.3.2 on the previous page.

Example 8 (Simple groupings filter). Here we have a few simple examplesof what the groupings filter should do.

• For regular expression (a|b) matched with a the mixed bit-values are=0:1|t:b. Since the whole regular expression is contained in a cap-turing parenthesis, nothing should be thrown away. Output shouldcontain =0:1|t:b.

• For regular expression (?:a|b)(c|d) matched with ac the mixed bit-values are =0:1|=0:1:b|t:b. This time the first part of the regularexpression is contained only in a non-capturing parenthesis and theassociated bit-values should be thrown away. We want to keep onlythe bit-values from the second alternation. Output should contain=:|=0:1:b|t:b.

In this example we have only dealt with simple examples. Regular expres-sions containing parenthesis under alternation and repetition, e.g. (a)|band (a)*, require extra care and will be discussed later. ∗

22


The output of the groupings filter can be used to navigate the NFA for theregular expression altered in a similar manner: Everything not in a captur-ing parenthesis is thrown away. From example 8 on the preceding pagewe have the regular expression (?:a|b)(c|d), if we throw away everythingnot in a capturing parenthesis we have left (c|d). Stated in a more formalmanner, we can define our first naive rewriting function G′:

G′[[ε]] = ε

G′[[a]] = ε

G′[[[...]]] = ε

G′[[r′r2]] = G′[[r′]]G′[[r2]]

G′[[r′|r2]] = G′[[r′]]G′[[r2]]

G′[[r∗]] = G′[[r]]

G′[[r+]] = G′[[r]]

G′[[r?]] = G′[[r]]

G′[[(? : r)]] = G′[[r]] (1)G′[[(r)]] = (r)

Capturing under alternation As is seen, G′ basically throws away any-thing not in a capturing parenthesis. There are however a few problemswith this definition, as hinted earlier. Our first problem is regular expres-sion with a capturing parenthesis under alternation. When the capturingparenthesis is under the alternation and we throw away the alternation,we lose a vital choice: There is no longer a way to signal whether or not agroup participates in a match.

Example 9 (Capturing under alternation). In matching the regular expres-sion (a)|(b), see figure 10(a) on the next page for the NFA, with the string awe obtain these mixed bit-values:

=0:1|t:b

What these mixed bit-values are saying is that we have 2 channels, onethat go through a and succeeds and one that go through b and fails. Thesucceeding channel never goes through b, the contents of that group is notdefined.Rewriting the regular expression (a)|(b) according to G′ we have:

G′[[(a)|(b)]] = G′[[(a)]]G′[[(b)]]= (a)(b)

23


ε

ε

a

b

(a) (a)|(b)

a b

(b) (a)(b)

ε

ε

a ε

ε

b

(c) (?:|(a))(?:|(b))

Figure 10: Capturing under alternation

In this regular expression there is only one way: The one going throughboth the groups. See figure 10(b) for the NFA of the expression. This is badnews for our rewriting function and our filter, since we need some way ofskipping groups: Each channel goes through only one group. ∗

In example 9 we saw an example of how undefined groups are not handled.To solve this problem we need some way of signaling if a group participatesin a match or not. We define a new rewriting function G′′ it is identical toG′ except for equation 1 which is changed to:

G′′[[(r)]] = (? : |(r))

This change will enable us to choose which groups participates in a match.This comes at a cost: Extra bits will have to be added to the mixed bit-valuesoutput and extra alternations to the rewritten regular expression.

Example 9 (continuing from p. 23). With the changed equation 1 we cancontinue our example from before. Again we rewrite regular expression(a)|(b), this time according to G′′:

G′′[[(a)|(b)]] = G′′[[(a)]]G′′[[(b)]]= (?:|(a))(?:|(b))

See figure 10(c) for the NFA. As is clear from the rewritten regular expres-sion and the NFA, there is now a way around the groups. Taking this intoaccount, the output for the groupings filter should be:

24


=1:01|0t:b

What these mixed bit-values are saying is that we have two channels, onepicks the route through a, around b and succeeds and the other picks theroute around a, through b and fails. ∗

As needed, we now have a way of signaling if a particular group is in amatch: Insert a 1 in the mixed bit-values and the group participates or inserta 0 and it does not.

Capturing under repetition The other problem we hinted at has to dowith capturing under repetition. When using a capturing subpattern, itcan match repeatedly using a quantifier. For example matching (.)* withthe string abc, the first time we apply the * we capture a a the second timea b and the last time a c. In such a case we have several options whenreporting the strings that was captured:

• The first

• The last, this is the what most backtracking engines like Perl do

• All, this is what a full regular expression engine do

Only two of these options are available to a streaming filter: All and thefirst. In order to return the last match, we would have to save the latestmatch when matching with the quantifier, it is potentially the last and wecan not know until we are done matching with the quantifier.Returning the first string that was captured by the quantifier, forces us tothrow away mixed bit-values generated in a capturing parenthesis. Wewould only need the mixed bit-values generated by the first iteration of thequantifier.To return all the strings captured by a group, we simply output all themixed bit-values generated while in the capturing parenthesis. However,this causes problems with the rewriting function. Rewriting (.)* accordingto G′′ we have (.). This regular expression accepts one single character. Inno way can we make mixed bit-values, fitting this regular expression, thatrepresent a list of matched strings. Therefore we add the following equa-tions:

G′′[[(r)∗]] = (r)∗G′′[[(r)+]] = (r)+

We should now also keep the mixed bit-values that glues the iterations to-gether, even though they are outside the capturing group.We are now ready to present the final rewriting function: Definition 3 onthe following page.

25


Definition 3 (The groupings filter rewriting function). For regular expres-sions r, r1, r2, defined over alphabet Σ, and a, any character from Σ, let Gbe defined by:

G[[ε]] = ε

G[[a]] = ε

G[[[...]]] = ε

G[[r1r2]] = G[[r1]]G[[r2]]

G[[r1|r2]] = G[[r1]]G[[r2]]

G[[r∗]] = G[[r]]

G[[r+]] = G[[r]]

G[[r?]] = G[[r]]

G[[(? : r)]] = G[[r]]

G[[(r)]] = (? : |(E))

G[[(r)∗]] = (? : |(r)∗)G[[(r)+]] = (? : |(r)+)

∗

Example 10 (The groupings filter rewriting function). Here follows a fewexamples of how the groupings filter rewriting function, G, works.

G[[(a)|b]] = G[[(a)]]G[[b]]

= (? : (a))ε

= (? : (a))

G[[((cup)cake)]] = (? : |((cup)cake))

G[[(a|b)∗]] = (a|b)∗

∗

26

4 Implementing a regular expression engine


The task of implementing a regular expression engine can be undertakenin steps. The first steps is converting the regular expression to a NFA. Thenext step is to simulate the NFA. The last step in our implementation is tobuild filters. We have included the source code in section F on page 97.

4.1 Regular expression to NFA

The first step in our regular expression engine is the regular expression toNFA converter. As discussed in section 2.4.1 on page 8, the NFA is builtfrom the regular expression in steps from smaller NFA fragments. In orderfor this method, used directly, to be successful, the regular expression hasto be in a form where the meta characters and the literals are presented inthe right order. Regular expressions with for example | can not simply beread from left to right and be converted correctly. The problem with thealternation operator is that it is an infix operator, so we only have the lefthand side and not the right hand side when we read the | and can thereforenot complete the fragment.Converting the regular expression to reverse polish notation, with an ex-plicit concatenation operator, or making a parse tree will solve these prob-lems. For this project neither is chosen. A third solution to this prob-lem is maintaining a stack where fragments and operators are pushed andpopped. This is the method that is implemented. We tried determining thequality of the decision by comparing run times with Russ Cox’s examplecode [3]. This did not go well due to several reasons. The main reasonis that the example code does not do well on large examples3 and largeexamples is needed to do a reasonable comparison.We followed Russ Cox’ method from [4], when converting the regular ex-pression to NFA. Russ Cox rewrites the regular expression to reverse polishnotation with an explicit concatenation operator, so some changes will benecessary. There are tree main areas that needs to be changed:

Concatenation While constructing the NFA, NFA fragments are pushedonto a stack. Whenever the concatenation operator is encountered,the two top fragments are popped and patched together, see figure 4on page 9. We do not have the advantage of an explicit concatenationoperator. Instead we will be trying to pop the top two NFA fragmentsand patching them together as often as possible. As often as possibleis after a character is read, but before any action is taken on the char-acter read. The exception to this rule is the quantifiers, which bindstighter than concatenation.

3There are constants in the source code and a naive list append function

27


Parentheses The binding of the operators can be changed with parenthe-ses. Not using a tree structure or reverse polish notation with an ex-plicit concatenation operator, there is nothing showing the structureof how everything binds when simply reading the regular expressionfrom left to right. We need some way of connecting the left parenthe-ses to the matching right parentheses. For this we will be using thestack, we will expand it to also accept operators. Every time we read aleft parenthesis in the regular expression, a left-parenthesis-fragmentis pushed onto the stack. When we later on read a right parenthesiswe simply pop fragments of the stack and patch them together till wereach a left-parenthesis-fragment.

Alternation When reading the regular expression left to right, we onlyhave the left NFA fragment ready when reading the alternation oper-ator. Therefore we simply push the alternation operator on the stack.Whenever possible we pop the alternation operator and associatedNFA fragments and patch them together, see figure 3 on page 9. Thisis probably not very often, as it will only happen after reading a rightparenthesis, a alternation operator or the end of the regular expres-sion.

We have two important helper functions: maybe_concat and maybe_alternate

. The first concatenates the top two fragments if possible, also see figure 4on page 9. The second alternates the top fragments, if possible, so also fig-ure 3 on page 9. maybe_alternate will pop alternate markers from thestack. These are called as often as possible to keep the stackdepth at a mini-mum and to avoid postponing all the concatenating and alternating till theend. Supplying a regular expression consisting entirely of left parenthesiswill still make the stackdepth grow to a maximum.

4.1.1 Character classes

Character classes are part of the extension we made to the regular expres-sion definition. When implementing, we have the choice of rewriting char-acter classes in terms of the original regular expressions, but as we can seein figure 11 on the next page, this quickly becomes unwieldy. When werewrite we add almost two states per character matched by the characterclass, instead of adding just one state for the whole character class. Whatwe want is a NFA similar to figure 11(a), not figure 11(b).There are several ways of obtaining this goal. Perl uses a bitmap to indi-cate membership of a range, for each character in the character set thereis a bit in the bitmap. To decide membership the bit corresponding to thecharacter is looked up. RE2 uses a balanced binary tree, each node in thetree corresponds to either a whole range or a literal character, the tree is

28


[a-c]

(a) [a-c]

ε

ε

ε

ε

a

b

c

(b) a|b|c

Figure 11: A simple character class-transition example

then searched when deciding membership. Each method has its advan-tages and drawbacks. The bitmap is of constant size, so for small characterclasses, it will be unnecessarily large, but the time to look up a value in thebitmap is also constant and very fast. The balanced binary tree, has its ad-vantages for character classes with few ranges and literal characters, sinceit will then be small in size and look up times. The drawbacks are of coursethat it grows in size and look up times with the character class.For this project an even simpler solution was chosen: A simple linked list ofranges. The literal characters will be represented as ranges of length one.In other words, we will have one linked list per character class, and thenumber of elements in each linked list is the number of literals and rangesin the character class. Worst case we will have to look through all membersof a linked list to decide membership of a character class. This is simplistic,but sufficient.

4.2 The simulator

We have built the NFA and the next step is to simulate it. This requireskeeping track of a set of active states. In a basic implementation of theThompson simulation algorithm [13], a state is only added to the active setonce. It is important to note that this will throw away matches becausewe only once add a state to the active set. For example when the regularexpression a|a is matched with the string a there are two possible routesthrough the NFA, but only one will be reported, since the final state willonly be added once to the set of active states.There are however at least two good reasons why you should not add astate more than once:

• Unless you are careful this will give rise to infinite loops in the simu-

29


ε

ε

ε

(a) ()*

ε

εε

ε

(b) (()())*

ε

ε

ε

ε

(c) (|)*

Figure 12: ε-cycles

lation process. More on this below.

• We open up for an exponential worst-case behavior. A good exam-ple is the same as the backtracking engine worst-case a?nan matchedwith an.

The problems with the infinite loops arise when matching with regular ex-pressions like ()*, see figure 12(a) for the NFA. The simulator will go into ainfinite loop generating these bit-values:

=0=0=0=0=0=0=0=0=0=0=0=0=0=0=0=0=0=0=0...

This is because there is a cycle of ε-transitions in the NFA. This would notbe desirable behavior and we would need to stop the simulation before itgoes into an infinite loop. Note that this is not implemented, as the prob-lems with the infinite loops are not applicable in a standard Thompsonsimulation.From [2] we have the depth-first search (DFS) algorithm. This algorithmcan be modified to detect cycles. In short it works by initially markingall vertexes white. When a vertex is encountered it is marked gray andwhen all its descendants are visited, it is marked black. If a gray vertexis encountered, then we have a cycle and do not need to explore furtheron this path. The algorithm terminates when all vertexes are black. The

30


algorithm will terminate, as we color one vertex each step and we alwayscolor the vertexes darker.To see why this algorithm detects cycles, suppose we have a cycle contain-ing vertex a. Then a is reachable from at least one if its descendants. Whenwe reach a from this descendant, it will still be colored gray, since we arenot done exploring a’s descendants. Thus the cycle is detected.Our problem is slightly different: We need to detect if we are in a cycleof ε-transitions. The DFS algorithm solution is still applicable, with slightmodifications, as we do a depth first search when we explore the ε-edges.There will be no white states. Instead we will have a counter that is incre-mented every time a character is read from the input string. Every time astate is encountered it is stamped with the counter. We can only trust thecolor of the state if the counter and the stamp are identical. The gray andthe black states work in much the same way.In figure 12 on the previous page we have some of the NFAs we encounter.We have an example of a long cycle in figure 12(b), more parenthesis addsmore ε-transitions. We also have an example of how more channels can becreated in the loop in figure 12(c) by adding alternations.Generating the mixed bit-values is a minor adjustment to the NFA simula-tion. Whenever we perform an action that needs recording, we just recordit.

4.3 Filters

When taking a closer look at how filters should be implemented in practice,there are some interesting considerations. These are detailed below.

4.3.1 Groupings

As we described in section 3.3.3 on page 22, this is the filter that should(more or less) throw away any mixed bit-values not generated in a cap-turing parenthesis. In order to do this we need to know which values aregenerated in a capturing parenthesis and which are not. We look to Lau-rikari [10] for inspiration. We will be using a NFA augmented with extraε-transitions. The extra transitions will be used to mark the beginning andend of a capturing parenthesis. We will use the mixed bit-values to navi-gate the NFA, whenever we are inside a capturing parenthesis we will copythe mixed bit-values to output.We rewrote the regular expression to allow for capturing under alterna-tion. We will need to insert a 1 when a group participates and a 0 whenit doesn’t. When exiting the upper arm of an alternation we need to knowhow many top level capturing groups there are in the lower arm and whenentering the lower arm we need to know how many top level capturing

31


groups there in the upper arm. Again we solve this problem by augment-ing the NFA. We insert the extra information in the split-state marking theentrance to an alternation and add an extra state at the end of the upperarm.We adopt a similar strategy to solve the problem of reporting only the firstmatch in capturing under a quantifier. We will again augment the NFAwith necessary information. A state is inserted at the end of the quantifier,so that this state is the last state that is met in a iteration of the quantifier.When we pass this state and do another iteration we will know that wehave already been there at least once and should not output any more bit-values.We could also have solved the problem of keeping track of how many timeswe have matched a quantifier by simply rewriting the regular expression.For example would we rewrite (a)* to (|a)a*. This was dropped becauseit can not be done easily on the fly by the NFA generator. The fragmentformed by (a) could no longer be considered a finished fragment that wasjust plugged into the rest. We use it with and without the capturing paren-thesis in the rewrite and would therefore need to open up the fragment andremove the capturing parenthesis for parts of the rewrite.

4.3.2 Trace

We will limit this filter to only output one channel with a match. In thecurrent system this is not actually a limitation - as we are using Thompsonsmethod for matching, there will only ever be one channel with a match.All channels are read and the bit-values are saved separately. Every timewe read a channel-split operator we will have to allocate a new chunk ofmemory and copy the bit-values we have accumulated up to this point.When the chunk of memory becomes too small, we will enlarge to a chunktwice the size. When we reach the end of the input stream, we will knowif there is a match and be able to output the bit-values that make up thematch.

4.3.3 Serialize

This is the filter that outputs what is matched. We will need the regularexpression, that matches the bit-values, to form the NFA. The NFA is tra-versed using the bit-values. As we go along we output the symbols thetransitions are marked with and the escaped symbols in the bit-values. Theoutput result is what was matched, as a string.

32

5 Optimizations

5 Optimizations

The program as described in section 4 on page 27 is unoptimized and writ-ten for readability and simplicity. This section deals with potential andrealized optimizations. In some cases it is necessary to make a choice: op-timizing one aspect (e.g. run time) can incur cost in another aspect (e.g.memory consumption), or vice versa. We must also differentiate betweendesign, source code or even lower levels of optimizations. Design changesoften involve changing the underlying algorithms, while changes at sourcecode level will typically have a less drastic effect given a sufficiently largeproblem size, but the impact on constant costs can be significant. This sec-tion will discuss both of these types of optimizations.Our binaries were translated with gcc version 4.4.5 and the following flags:-O3 -march=i686. See section A on page 74 for the specifications of thetest computer.

5.1 Finding out where

It is often difficult to simply guess where we might gain potential perfor-mance benefits. We can have a theoretical analysis of the algorithms in-volved but these do not cover unexpected constant costs (an overly expen-sive system call performed per iteration in a loop for example), or potentialerrors in the implementation. Therefore, the first step in optimizing shouldbe running an analysis on the implementation. To this purpose we have setup a few experiments to determine the actual memory usage and runtimesof the programs. In all the following experiments, the first program, main,in our pipe is called with a regular expression matching capturing all: (.*)and about 114KB of text generated by lipsum.com to be matched.

5.1.1 Memory usage

In table 1 on the next page we have the peak memory usage charted. Thesenumbers were collected with valgrind using the parameters --tool=massif--stacks=yes. These parameters mean we collect information on stackand heap usage. We can see from the table that all programs except traceuse a negligible amount of memory. The Perl script in section D on page 78was used for collecting the data in this paragraph.

5.1.2 Output sizes

The sizes of the output from the previous experiment on memory usageis plotted in table 2 on the next page. We can see that there is a big sizedifference between input to main and output; the output is 9 times biggerthan the input. The same goes for the output of groupings all. In this

33

lipsum.com

5 Optimizations

Table 1: Peak memory usage

Program Size (KB) Program Size (KB)

main 3.43 groupings all 3.43trace 1500.16 serialize 3.43ismatch 3.43

Table 2: Sizes of output

Program Size (KB) Program Size (KB)

main 1022.5 groupings all 1022.5trace 340.8 serialize 113.6ismatch 0

case, it is because there is nothing to be removed by this filter with thisparticular regular expression and can be considered a worst-case scenariofor this particular filter. The output of trace is 3 times bigger than theoriginal string of text.

5.1.3 Runtimes

We made a small experiment to measure the runtimes of each program withthe help of the Perl script in section D on page 79. We used the regular ex-pression (.*) and a log file of suitable size as inputs. In table 3 the runtimesfor the different programs is seen. Note that trace takes more than 5 hoursto complete (profiler was disabled when recording this).

5.1.4 Profiling

We need to analyze the runtime behavior of the programs, to see whichfunctions takes up the runtime of the programs. For this purpose, we useda external tool, a profiler, of which there are many to choose from. Wechose to use gprof as it can give us an overview of which functions arecalled and how much time is spent in each.

Table 3: Runtimes

Program Runtime (s) Program Runtime (s)

main 1.010 groupings all 2.168trace 18762.863 serialize 0.443ismatch 0.773

34

5 Optimizations

0 5 10 15 20 25 30 35

% time

write_bit

addstate

step

is_in_range

match

other

Funct

ion

Figure 13: Output of gprof running main

The programs was compiled and linked with the -pg option to enable pro-filing data to be collected for gprof. We need to extend the runtimes toget better results from gprof, so instead of the text from lipsum.com weused a log file of suitable size. The size was chosen to be small enough tofit in memory, but big enough to produce longer runtimes. Section D onpage 81 contains the perl script used for profiling.In figures 13, 14, 15 and 16 we have the output from gprof flat profilecolumn marked % time. This column describes the percentage of the totalrunning time used by this function. Only functions taking up more than5% of the total runtime is included, the rest is bunched together in the othercolumn.

main In figure 13 we have the values for running main. Functions add_state, step, is_in_range and match is all called in the process of simulating theNFA, this takes up about 65% of the total runtime. The rest is taken up byIO: Function write_bit. For this example, the process of creating the NFAis near instantaneous. We can also see the penalty for choosing a simple so-lution to the character class problem, the function for deciding membershipis_in_range takes up about 9% of the total runtime.

groupings all In figure 14 on the following page we have the valuesfor running groupings all. The main loop function, read_mbv takes upabout 40% of the total runtime. We also see a large amount of runtimebeing taken up by functions with a small amount of runtime each, these arehelper functions to the main loop and functions to do with keeping track ofthe channels. Again does the IO functions write_bit and read_bit takeup a fair amount of runtime, about 23%.

35

lipsum.com

5 Optimizations

0 10 20 30 40 50

% time

read_mbv

write_bit

read_bit

follow_epsilon

other

Funct

ion

Figure 14: Output of gprof running groupings all

0 10 20 30 40 50 60

% time

read_mbv

channel_copy

channel_write_bit

other

Funct

ion

Figure 15: Output of gprof running trace

trace In figure 15 we have the values for running trace. The main loopfunction read_mbv takes up more than half the total runtime. We spend alot of time copying, writing to, appending and freeing channels, about 40%of the total runtime. Functions prepended with a channel_ deal with chan-nel management. Here the IO functions read_bit and write_bit take upa relatively little amount of runtime, about 5% in total.

serialize In figure 16 on the following page we have the values forrunning serialize. Again the main loop function takes up a lot of time,about 47% of the total runtime. I/O functions read_bit and write_bit

takes up about a third of the total runtime.

36

5 Optimizations

0 10 20 30 40 50

% time

read_bv

read_bit

follow

write_bit

Other

Funct

ion

Figure 16: Output of gprof running serialize

5.2 Applying the knowledge gained

In the previous section we identified a few trouble spots and we will dis-cuss potential remedies in this section.

IO From the output of gprof we can see that a lot of our runtime, in mostprograms, is used in the two I/O functions read_bit and write_bit

.

trace trace takes too long to complete. We believe that a good placeto start would be to look for alternative channel management tech-niques.

Main loops A lot of the runtime is spend in main loops. This is not nec-essarily a good place to start optimizing, this could just as well bebecause we were not good at spreading out the workload in smallerfunctions.

5.2.1 ε-lookahead

Each ε-transition is marked with a look-ahead symbol. The ε-transitioncan then only be taken if the look-ahead symbol matches the next characterin the input string. This optimization will have double effect, it will bothreduce the number of states in the active set when simulating the NFA andit will reduce the mixed bit-values output at the cost of extra memory forand time spend constructing the NFA.This optimization has not been implemented.

37

5 Optimizations

Table 4: Frequencies of operators in amixed bit-value string

Operator Match alla Match wordsb

0 11% 13%1 11% 13%: 22% 27%| 11% 2%= 11% 13%\ 11% 8%b 11% 13%t 0% 0%a .*b (?:(?:(?:[a-zA-Z]+ ?)+[,.;:] ?)*..)*

5.2.2 Improved protocol encoding

The protocol used for transmitting data is text-based, using a binary pro-tocol would make the content more terse, but also nigh impossible to readfor a human.

To transmit one operator in the text based protocol we always use 8bits. Since we only have 8 different operators, this can be done using lessbits. A widely used and effective technique for lossless compression ofdata is Huffman codes [2]. To encode our data efficiently with Huffmancodes we need to analyze our data; we need to know the frequency withwhich the operators appear. In table 4 we have some frequencies of theoperators using different regular expressions on a text file generated bywww.lipsum.org of size 114KB. We have a simple and a somewhat com-plex example. Since this is a table of operator frequencies, we have left outthe escaped characters, this means the numbers will not sum to 100. Whatis missing is the escaped characters, these have the same frequency as theescape operator.

In figures 37 on page 75 and 38 on page 76 we have the correspondingHuffman trees, they yield the encoding in table 5 on the following page.We observe that as the regular expression gets more complicated, we use |and \ less and 0, 1, :, = and b more. This is also reflected in the Huffmanencoding. Since most regular expression will be more complex than .*, wechoose the encoding in the match words column. This gives us a compres-sion ratio of 0.39, for the match words case and a compression ratio of 0.44for the match all case.

38

www.lipsum.org

5 Optimizations

Table 5: Huffman encoding

Operator Match alla Match wordsb

0 1110 0101 010 110: 00 10| 011 01110= 100 111\ 101 0110b 110 00t 1111 01111a .*b (?:(?:(?:[a-zA-Z]+ ?)+[,.;:] ?)*..)*

Table 6: Runtimes for different buffer sizes

Buffer size (B) Runtime (s) Buffer size (B) Runtime (s)

0 8.786 128 0.3912 4.507 256 0.3384 2.559 512 0.3208 1.354 1024 0.313

16 0.886 2048 0.31432 0.618 4096 0.31064 0.437 8196 0.324

We cannot make assumptions beforehand as to the frequencies of the es-caped characters as it depends on the text being matched. If the escapedcharacters all have the same frequency, then the Huffman method can achieveno compression. Instead we can look to the character class that generatedthe escaped character. By rewriting the character class with the | operatorand creating the corresponding partial NFA as a balanced tree and use thistree as we would a Huffman tree, we can compress the escaped characters.How efficient this method is depends on how many characters is matchedby the character class. For example if we match all characters, then no com-pression would be obtained, if on the other hand we had a small characterclass like [a-d] we could encode characters matched by this class using only2 bits.We did not implement this optimization.

39

5 Optimizations

Table 7: Runtimes for different buffer sizes using non-thread-safe functions

Buffer size (B) Runtime (s) Buffer size (B) Runtime (s)

0 8.597 128 0.2482 4.355 256 0.2044 2.288 512 0.1818 1.240 1024 0.177

16 0.716 2048 0.16432 0.463 4096 0.15164 0.308 8192 0.162

5.2.3 Buffering input and output

The two functions called when doing I/O is fputc and fgetc. These areincluded with the stdio.h header file. Reading up on those two revealthat they are buffered and thread-safe. There is even a function for ma-nipulating the buffering method: setvbuf. We did a bit of experimentingwith setvbuf, the results are shown in figure 6 on the previous page. Theruntimes in the graph is the combined runtime of all programs, i.e. wemeasured the runtime for all programs combined with pipes:

echo lipsum | ./main regex | ./groupings_all regex | \./trace | ./serialize regex’

regex and lipsum are the same as used in section 5.1 on page 33. We cansee that not much is gained from a buffer size exceeding 1024 bytes.

Thread-safety We also looked into thread safety. There are non-thread-safe variants of fputc and fgetc, fputc_unlocked and fgetc_unlocked

respectively. Note that the man-pages does state that these thread-unsafefunctions probably should not be used. The results from an experimentsimilar to the previous, only using the non-thread-safe functions for IO, arein table 7. The non-thread-safe functions are on average 0.16 seconds faster.

5.2.4 Channel management in trace

The problem with trace is that we do not know which channel has amatch, so we need to keep track of the bit-values on all of them. If weinstead read the mixed bit-values backwards, the first character we wouldread on a channel would be the t or b operator. This does require us toread the whole string of mixed bit-values and reversing it. Because thisfilter already is non-streaming, it will not become a problem reading andstoring the whole string of mixed bit-values.

40

5 Optimizations

This optimization is implemented. Running the experiments determiningruntime and memory usage for the improved trace gives us a runtime ofjust 1.309 seconds and a memory usage of 1536KB. Comparing this to thevalues for the old trace we see a huge performance gain in runtime and aslight increase in memory consumption. This optimization is well worth it.We would expect no performance gain on this optimization when the reg-ular expression consists of a string of literals, that is we only create onechannel when simulating the NFA. This is however not a very useful ap-plication of regular expressions. Instead, a simple string compare wouldsuffice.

41

6 Analysis of algorithms


In this section we analyze the complexity of the various programs and fil-ters. The analysis is presented as tables of functions, with a short descrip-tion of what it does and what other functions it calls and whether or not thisis done in a loop. Most importantly we will include the complexity, for run-time and storage requirements of each function in the tables. The analysis isa worst-case analysis - in some situations this outcome is unlikely barring apathological input. Straightforward cases will not be commented, howeversome entries in the table will require a more detailed description. Thesewill be given where necessary. Each component is summarized based onthe contents of the function tables.In the following sections n denotes the length of the regular expression andm is the length of the input string.

6.1 Constructing the NFA

In tables 8, 9 and 10 we have a overview of the functions involved in con-structing a NFA. We have the theoretical upper bound on run time andmemory consumption in big-o notation. Most functions are straightfor-ward but a few deserve a comment.ptrlist_patch is the function that patches a list of dangling pointers toa state. The pointers are only dangling once in their lifetime and do notbecome dangling again once patched to a state. The upper bound on thetotal number of dangling pointers is two times the number of states andthe state count is linear in the size of the regular expression, making theupper bound on the number of pointers linear in the size of the regularexpression also. The total amount of work is linear in the size of the regularexpression. The number of times the ptrlist_patch function is called islinear in the size of the regular expression. The amortized cost of callingthe ptrlist_patch function is therefore O(1).ptrlist_free, see argument for ptrlist_patch.cc2fragment also has a loop over the regular expression. The reason whythis does not become a O(n2) operation is that the loop counter is shared be-tween the two functions. Any progress made by one function in the regularexpression is shared with the other.We note that all functions called by the main loop function re2nfa are O(1)both in run time and memory, apart from cc2fragment see above com-ment. re2nfa allocates a stack for use in the construction process, thenumber of elements in the stack is the number of characters in the regu-lar expression. We call the functions a number of times that are linear inthe size of the regular expression, we therefore conclude that constructinga NFA is O(n) in both run time and O(n) in memory consumption.

42


Table 8: Analysis of re2nfa (part 1)

Function Description Run time Memory

state Allocates memory for states O(1) O(1)fragment Assigns values for a fragment O(1) O(1)ptrlist_list1 Allocates pointer list of one O(1) O(1)ptrlist_patch Patches list of pointers O(1)a O(1)ptrlist_append Appends two lists of pointers O(1) O(1)ptrlist_free Frees list of pointers O(1)a O(1)read_paren_type Determines parenthesis type O(1) O(1)range Allocates memory for a range O(1) O(1)parse_cc_char Parses a character in a charac-

ter classO(1) O(1)

a Amortized cost

6.2 Simulating the NFA

In table 11 we have the overview of functions involved in the simulationof the NFA. We have the theoretical upper bound on run time and memoryconsumption in big-O notation. Some functions deserve a comment.addstate and last_addstate marks states they already have visited inone step. Upon encountering a state it already has visited, it returns. Thismeans that the total amount of work in a single step is O(n) and not O(n2).is_in_range is also called once per active state in step and last_step.We only create one state per character class, what we do instead is create alist of ranges representing the character class. The more ranges the fewerstates and vice versa. This (again) means that the total amount of work ina single step is O(n) and not O(n2).We note that for each character in the input string, in the worst case wehave to visit all states: O(n ∗ m). We allocate space enough for the NFAwhich is O(n) and two lists to keep track of the active states which is alsoO(n), resulting in a total of O(n).

6.3 The ’match’ filter

Since this filter is very simple - we only call write_bit and read_bit onceper input character - we have elected to just provide a textual summaryhere. This filter has run time linear in the length of the input and a memoryconsumption of O(1).

43




cc2fragment Makes a fragment of a char-acter class. Calls state andfragment sequentially andparse_cc_range in a loopover the regular expression

O(n) O(1)

parse_cc_range Parses a range in a charac-ter class. Calls range andparse_cc_char sequentially

O(1) O(1)

maybe_concat Concatenates top two frag-ments. Calls ptrlist_patchand ptrlist_free sequen-

tially

O(1) O(1)

maybe_alternate Alternates top fragments.Calls state, fragment

, ptrlist_list1 andptrlist_append sequen-tially

O(1) O(1)

do_right_paren Does a right parenthe-sis. Calls maybe_concat

, maybe_alternate,state, fragment andptrlist_list1 sequen-tially

O(1) O(1)

finish_up_regex Finishing touches onforming the NFA. Callsstate, maybe_concat

, maybe_alternate

, ptrlist_append

, ptrlist_list1,ptrlist_patch andptrlist_free sequentially

O(1) O(1)

do_quantifier Does quantifier. Callsstate, ptrlist_patch,fragment, ptrlist_list1,ptrlist_append sequen-tially.

O(1) O(1)

44




re2nfa Main loop. Callsstate, do_quantifier

, maybe_concat,maybe_alternate,fragment, do_right_paren,cc2fragment looping overthe regular expression andfinish_up_regex sequen-tially

O(n) O(n)

Table 11: Analysis of match


is_in_range Determines if a character isaccepted by a character class

O(n) O(1)

read_bit Reads character from input O(1) O(1)write_bit Outputs bit O(1) O(1)last_addstate Add a state with final char-

acter in input string. Callswrite_bit last_addstate

O(n) O(1)

last_step Advance simula-tion final character.Calls last_addstate

, is_in_range andwrite_bit in a loop overactive states

O(n) O(1)

addstate Add a state. Calls write_bitand addstate

O(n) O(1)

step Advances simulation onecharacter. Calls addstate

and write_bit in a loopover active states

O(n) O(n)

match Matches a NFA with a string O(m ∗ n) O(n)

45


6.4 The ’groupings’ filter

In table 12 we have the overview of the functions in the ’groupings’ filter, ois the input length.follow_epsilon follows all possible transitions for a channel. The numberof possible transitions is bounded the number of actual transitions O(n). Itis called for (almost) every character. This leads us to the overall run-timecomplexity for this filter: O(n ∗ o). The memory used is the memory for theNFA and the list of active channels, we cannot have more active channelsthan states: O(n).

6.5 The ’trace’ filter

In table 13 we have the overview of the functions in the trace’ filter, o is theinput length. The version of the ’trace’ filter analyzed here, is the optimizedone.channel_write_bit reallocates memory when it runs out, the allocationstrategy is to double the amount of memory every time we run out. Thecomplexity of copying the data over all calls to channel_write_bit is 1 + 2+ 4 + 8 + ... o = O(o). Because this is the cost of all calls to channel_write_bit, the cost of trace is only O(o) and not O(o2).This filter is O(o) complexity both in run time and memory.

6.6 The ’serialize’ filter

In table 14 we have the functions in the ’serialize’ filter. o is the input length.This is another uncomplicated filter. We basically just read input and stepthrough the NFA: O(o ∗ n). The only memory we allocate is for the NFA:O(n).

46


Table 12: Analysis of the ’groupings’ filter


channel Allocates memory for chan-nel

O(1) O(1)

channel_copy Copies a channel O(1) O(1)channel_free Frees a channel O(1) O(1)channel_remove Removes a channel from

the channel list. Callschannel_free

O(1) O(1)

follow_epsilon Follow all possible transi-tions from current state. Callswrite_bit

O(n) O(1)

do_end Channel has ended. Callschannel_remove andwrite_bit

O(1) O(1)

do_split Splits the channel. Callschannel_copy

O(1) O(1)

do_one Channel takes the transitionmarked 1. Calls write_bit

and follow_epsilon

O(n) O(1)

do_zero Channel takes the transitionmarked 0. Calls write_bit

and follow_epsilon

O(n) O(1)

do_escape Handles escaped character ininput. Call follow_epsilonand write_bit

O(n) O(1)

read_mbv Main loop. Calls read_bit

, write_bit, do_split,do_end, do_zero, do_one,do_escape in a loop over theinput and follow_epsilon

O(n ∗ o) O(n)

main Calls re2nfa and read_mbv O(n ∗ o) O(n)

47


Table 13: Analysis of the ’trace’ filter


channel_write_bit Writes bit in channel. Reallo-cates memory if necessary

O(o) O(o)

trace Reads mixed bit-valuesbackwards and stores thechannel containing a match.Calls channel_write_bit

in a loop over the mixedbit-values

O(o) O(o)

read_mbv Reads whole input string intomemory, reallocates memoryif it runs out. Calls read_bitin a loop over input

O(o) O(o)

reverse Reverses input string O(o) O(o)main Calls read_mbv, reverse,

trace and write_bit

O(o) O(o)

Table 14: Analysis of the ’serialize filter


follow Follows all legal transitions.Calls write_bit

O(n) O(1)

read_bv Reads bit-values from input.Calls read_bit, write_bit

and follow in a loop over in-put

O(o ∗ n) O(1)

main Calls re2nfa and read_bv O(o ∗ n) O(n)

48

7 Evaluation

7 Evaluation

In this section we will be looking at how our programs compare to otherimplementations. We have chosen a few languages and libraries that wefeel are interesting:

RE2 RE2 is a new open source library for C++ written by Russ Cox. It isonly a little more than a year old. It uses automata when matching. Itdoes not offer back-references.

TCL TCL added regular expression support in a release in 1999. The reg-ular expression engine is written by Henry Spencer. It uses a hybridengine. It is a interpreted language.

Perl Perl is from 1987 and is written by Larry Wall. It uses backtrackingand virtual machines when matching. It is an interpreted language.

They are few in numbers, but they cover the basics in underlying technol-ogy and performance.The benchmarks here can not be considered exhaustive, instead we havetried picking a few that would show interesting features of our programs.Our implementation will be denoted as Main in the graphs.

Input method Our chosen method of input has a drawback, namely theupper limit on size for command line input. The system we tested on, seealso A on page 74, has a upper limit on command line input of 2MB. Thiscan be ascertained (on our test system at least) by issuing the followingcommand:

$ getconf ARG_MAX2097152

In the unlikely event that a user will need to match with a regular expres-sion exceeding this 2MB limit, there is always the option to use a file in-stead. Files only suffer the limit that they, along with the intermediate datagenerated by this solution, need to fit in the free virtual memory space.

7.1 A backtracking worst-case

Our first benchmark is taken from [4] and demonstrates the worst-case be-havior of the backtracking algorithm. Using superscripts to denote stringrepetition, we will be matching a?nan with the string an. For example willa?3a3 translate to a?a?a?aaa. We expect Perl to do poorly in this bench-mark, while the rest should do well.For this experiment we used the programs main and ismatch. The scriptused for the backtracking worst case is in sections E on page 81 and E onpage 84.

49

7 Evaluation

500 1500 2500 3500 4500

n

0

1

2

3

4

seco

nds

mainre2 tcl

Figure 17: A backtracking worst-case: Main, Tcl and RE2 runtimes.

Runtimes

The runtimes can be seen in figures 17 and 18. As anticipated: Perl ex-hibits very poor performance. The slope on figure 18 on the next page andthe logarithmic scale suggest that Perl runs in time exponential in n. Thisis not surprising considering that the ? matches greedily, meaning it willfirst try to consume a character from input. The only way that the regu-lar expression matches the string is if all the quantifiers consume no input.There are 2n possible ways for the quantifiers to consume and not consumea character. The backtracking engine has to search through all the 2n possi-ble solutions to find the matching one, since it matches greedily.In figure 17 we do not see much difference in the performance of Main,RE2 and Tcl. RE2 stops before the others, it has a upper limit on how muchmemory it will consume. The limit is user (compile time) defined for RE2.This limit could have been set to a value that would allow RE2 to continuematching with Main and Tcl. We chose not to do this, because we wanted todemonstrate this feature in RE2. This feature is especially useful in setupswhere memory is very tight or you accept regular expressions from un-trusted sources, but can obviously also be considered a nuisance in othersituations where you do not want to fiddle with this limit, but just want tomake RE2 do your matches.As a side note, we found that if we added a literal b at the end of the regularexpression, so that the regular expression no longer matches the string, wewould suddenly see marked improvements in the runtimes of Perl. Usinga backtracking algorithm it would still take time exponential in n to decidethey did not match, but Perl scans the input string for all literals in theregular expressions, and it quickly discovers that there is no b in the inputstring, so therefore the regular expression can not match the string.

50

7 Evaluation

3 5 8 10 13 15 18 20 23 25

n

0.001

0.01

0.1

1

10

seconds

perl

Figure 18: A backtracking worst-case: Perl runtime on a logarithmic scale.

0 1000 2000 3000 4000 5000 6000

n

0

10

20

30

40

50

60

MB

totalre2tcl

Figure 19: A backtracking worst-case: Total, Tcl and RE2 memory usage.

Memory usage

Memory usage is depicted in figures 19 and 20. For our program we haveadded up the memory usage of the individual programs and displayedthem under the total header.In figure 19 we again note that RE2 stops before the other two because ofmemory limitations. The memory usage of our programs and RE2 doesnot appear to be more than linear in the input size, while Tcl looks morelike some quadratic function. It is hard to give a good explanation to thiswithout knowing Tcls’ regular expression engine better; even if it did use aNFA for this match, the size should still be linear in the input. Tcl has thesame asymptotic runtime performance as Main and RE2, but with bigger

51

7 Evaluation

0 5 10 15 20 25 30

n

0

1

2

3

4

5

6

MB

perl

Figure 20: A backtracking worst-case: Perl memory usage.

constants; all that extra memory spent is not being put to good use.Figure 20 shows Perls memory usage. It is hard to say anything usefulbased on that graph, the values for n are too small. Unfortunately, due tothe exponential run-time of the problem it is not viable to increase them.

7.2 A DFA worst-case

Using superscripts to denote string repetition, constructing a DFA from theregular expression (a|b)*a(a|b)n results in a exponential blow up of the statecount. For example will (a|b)*a(a|b)3 translate to (a|b)*a(a|b)(a|b)(a|b). Ac-ceptance with a DFA is decided in time linear to the size of the input string,but we would still have to store the DFA which takes space exponential inthe size of the regular expression. We expect any regular expression engineusing DFAs to do poorly on this benchmark. The engines that can be ex-pected to use DFAs are RE2 and Tcl, but both can switch method accordingto need.For this experiment we used the programs main and ismatch. The scriptused for the backtracking worst case is in sections E on page 85 and E onpage 87.

Runtimes

In figures 21 and 22 we have the figures displaying the runtimes of the var-ious programs. Again we note that RE2 stops before the others, see above.Tcl stands out with significantly lower performance, but none appears tohave exponential or worse asymptotic behavior. This would suggest thatTcl chooses to use a DFA in some form for this match and RE2 falls back onsomething else.

52

7 Evaluation

500 1500 2500 3500 4500

n

0

1

2

3

4

5

6

7seconds

mainperlre2

Figure 21: A DFA worst-case: Main, RE2 and Perl runtimes.

60 120 180 240 300 360 420 480 540 600

n

0

1

2

3

4

5

6

7

seconds

tcl

Figure 22: A DFA worst-case: Tcl runtimes.

53

7 Evaluation

0 1000 2000 3000 4000 5000 6000

n

0

1

2

3

4

5

6

7

8

MB

totalre2

Figure 23: A DFA worst-case: Main and RE2 memory usage.

0 1000 2000 3000 4000 5000 6000

n

0

200

400

600

800

1000

MB

perl

Figure 24: A DFA worst-case: Perl memory usage.

Memory usage

The memory usage proved to be a more complicated matter than the run-times, see figures 23, 24 and 25 display.Our programs and RE2 appears to be using memory linear in the size of theregular expression. There is a sharp rise in memory consumed by RE2 forn smaller than about 1000. This would indicate that RE2 uses a DFA untilthe exponential factor becomes to big and forces it to switch method.Perls memory usage is mapped in figure 24. Compared to our programsPerl uses rather a lot of memory. It seems to be increasing in a quadraticmanner.In figure 25 on the following page we have Tcls memory usage mapped.Note the logarithmic scale. Our suspicion that Tcl uses a DFA is confirmed

54

7 Evaluation

0 100 200 300 400 500 600 700

n

10

100

MB

tcl

Figure 25: A DFA worst-case: Tcl memory usage.

by the memory usage, which appears to be exponential in the size of theregular expression.

7.3 Extracting an email-address

This is the first of our real world benchmarks. We will be extracting anemail-address from a string of text. Since we can not do partial matches,we will be constructing strings of increasingly long email-addresses. Theregular expression is taken from [14]. Unlike the two previous benchmarks,the regular expression is kept constant and does not grow.Since we are extracting a value we are using main, groupings all, traceand serialize for this match. The scripts can be found in sections E onpage 88, E on page 90 and E on page 92

Runtimes

In figure 26 on the following page contains the runtimes. All appear to berunning in time linear to the input string, with Tcl clearly having the lowestconstants and our programs the largest.

Memory usage

In figure 27 on the next page we have the memory usage of the programs.All programs except ours seem to be using memory linear in the inputstring. We seem to be using memory in a stepped manner. This corre-lates well with our scheme for memory management in trace: We doublethe amount of memory used every time we run out. This is confirmed byfigure 28 on page 57, which displays the memory usage for the individual

55

7 Evaluation

0 40000 80000 120000

length

0

0.05

0.1

0.15

0.2

0.25se

cond

s

mainperlre2tcl

Figure 26: Extracting an email-address: Runtimes.

0 40000 80000 120000

n

0

2

4

6

8

10

12

14

16

MB

totalperlre2tcl

Figure 27: Extracting an email-address: Memory usage.

56

7 Evaluation

0 40000 80000 120000

n

0

1

2

3

4

5

6

7

8MB

maingroupings_alltraceserialize

Figure 28: Extracting an email-address: Individual programs memory us-age.

1307 11407 31507 61607 101707

n

0

500

1000

1500

2000

2500

KB


Figure 29: Extracting an email-address: Sizes of output from individualprograms.

57

7 Evaluation

0 40000 80000 120000

n

0

10

20

30

Figure 30: Extracting an email-address: The relationship between inputsize to trace and size of input string.

programs. Here we see that all our programs except trace use a constantamount of memory. It is hard to tell from figures 28 and 27 on page 56if the memory used is linear in the input string. What we need to knowis the size of the mixed bit-values compared to the input string; we havedisplayed this in figure 29 on the preceding page where we have the sizesof the output from the individual programs. We did not put in a separatecolumn for the size of the input string since this is exactly the same as thesize of the output from serialize. There is a linear relationship betweenthe output of groupings all and serialize: The first is 20 times big-ger than the latter. See figure 30. This leads us to the conclusion that we areusing memory linear in the size of the input string for this match.

7.4 Extracting a number

Our fourth and last benchmark is also a real world example taken from [14].This one extracts a number from a string. We can not do partial matches,so again we will be using a string consisting of increasingly large numbers.The regular expression is constant.We will be extracting a number, so we will be using programs main, groupings all,trace and serialize for this match. The scripts can be found in sec-tions E on page 93, E on page 95 and E on page 97.

Runtimes

In figure 31 on the next page we have the runtimes for this benchmark.They all appear to be linear in the size of the input string. Our programsclearly have bigger constants than the rest.

58

7 Evaluation

0 40000 80000 120000

length

0

0.1

0.2

0.3

0.4

0.5

seconds

mainperlre2tcl

Figure 31: Extracting a number: Runtimes.

0 40000 80000 120000

n

0

2

4

6

8

10

12

14

16

MB

totalperlre2tcl

Figure 32: Extracting a number: Memory usage.

Memory usage

In figure 32 we have the memory usage. We clearly see the same patternas in the extracting an email address benchmark: that our programs usememory in a stepped but linear manner and the rest use memory linear inthe size of the input string. We investigate further, see figure 33 on the nextpage for the individual programs memory usage. All our programs, excepttrace use a constant amount of memory. The steps in the memory usage oftrace is even more pronounced in this figure. See above note on memorymanagement strategy in trace for explanation. The relationship betweenthe sizes of the output from the different programs is displayed in figure 34on the following page. The size of the input string is exactly the same asthe output from serialize. For greater clarity we have the relationship

59

7 Evaluation

0 40000 80000 120000

n

0

1

2

3

4

5

6

7

8MB


Figure 33: Extracting a number: Individual programs memory usage.

9004 33004 40004 84004 90004

n

0

1000

2000

3000

4000

KB


Figure 34: Extracting a number: Sizes of output from individual programs.

60

7 Evaluation

0 40000 80000 120000

n

0

10

20

30

40

Figure 35: Extracting a number: The relationship between input size totrace and size of input string.

between the input size to trace and the size of the input string displayedin figure 35. We do not observe the same fixed relationship between thetwo as we did in the extracting an email address benchmark - there seemto be no increasing trend. We generate between 20 and 30 mixed bit-valuesper byte in the input string. We again use memory linear in the size of theinput string.

7.5 Large files

We also did some experiments involving a large (333.5MB) log file. Thelog file was lifted from another project where it was used to gather dataon file access [16]. The data in the log file was aggregated, one line at atime, using some regular expressions that could easily be translated for usein our regular expression engine. Our experiment consisted of replicatingthe aggregation of data. This is where our problems with this experimentbegan: We tried to read the whole file at the same time, since the currentversion of our framework has no support for line-by-line reads. After sometime this resulted in an out of memory error.This is not a problem specific for our framework; we observe that this op-eration on some other existing implementation, such as Perl, also results inan out of memory error.The lesson in this experiment is that we lack some way of applying regu-lar expression one line at a time. We could have tried applying the regu-lar expressions with the help of awk, but it seemed superfluous invokingour pipeline of programs when awk already has a perfectly good regularexpression engine built in. Our other choice would have been making ashell-script and used a loop and cat.

61

7 Evaluation

7.6 Correctness

We have to the best of our ability tested the programs for errors.This was accomplished by creating a large database of approximately 320+test cases and repeatedly verifying the programs against this database, bothduring development and subsequent optimizations and changes.Unfortunately, it was discovered late in the project that a specific situa-tion is not handled correctly. The problem has been fixed at the designlevel but is still present in the implementation: using the ’groupings’ filterwith a quantifier will not always produce correct results, for example (a)*matched with aaa will produce the output 111 from the groupings_all

filter, which is not correct. Part of the problem is that we are not producingthe required bit-values to bind the captured bit-values together.

7.7 Conclusion

We observe that our programs do not suffer from either the backtrackingworst-case or the DFA worst-case problems.

Correctness We do not in all situations produce correct output on all com-binations of regular expressions, input strings and filters. See example insection above.

Simple acceptance We can compete on even footing with the best when itcomes to a simple acceptance decision. Our main program combined withthe ’match’ filter is fast and uses memory linear in the size of the regularexpression.

Capturing groups When it comes to the more complicated task of captur-ing the contents of groups, we are lagging behind. The task of capturinggroups is achieved by a 4 programs long pipe in our framework. Even if allof them only used memory linear in the length of the regular expression,the overhead of running separate programs add up. The ’trace’ filter usesmemory linear in the size of the input string. While this is obviously notquite as good as the other filters, it is nevertheless a good result, we havemanaged to separate this functionality into an independent component.An obvious drawback is the constant basic overhead of having four indi-vidual programs instead of one. We note however that this is currentlycompletely unoptimized and that this cost will not rise relative to either in-put. It is also still quite low; it would likely only be a relevant problem in ademanding environment such as embedded devices.The other main problem is the blow up of the size of the mixed bit-values.We have previously discussed potential solutions to this problem.

62

8 Related work

8 Related work

The groundwork of regular expression has been known for decades. Cur-rent research is often focused generally; for example to improve matchingspeeds, but there is also heavy activity towards more specialized purposes,such as XML parsers, hardware based spam detection or even such fields asbiology where regular expressions are often used to find patterns of aminoacids.The intended aim of this particular project is towards the general realm (thecurios reader may wish to read [12] if they are interested in special usagesof regular expressions). For this reason, the remainder of this section willtherefore concentrate on likewise general research.

8.1 Constructing NFAs

It is possible to construct different NFAs accepting the same language. TheseNFAs will have different properties.

8.2 Simulating NFAs

Using the NFA to determine membership is also simulating the NFA. Thereare many different ways of doing this, we will in the following describe twoalternatives to the method we have used.

8.2.1 Frisch and Cardelli

Frisch and Cardelli presents a method to simulate a NFA in their paper [7].It works in two passes over the input string.The first pass annotates the input string with enough information to de-cide which branch to pick in alternations and how many times to iterate aquantifier. This is done by reading the input from right-to-left and workingour way backwards in the NFA, the visited states are annotated and storedwith the input string. In the second and main pass, where we read theinput string left-to-right and work our forwards in the NFA, these anno-tations are used to decide which branch to take in an alternation and howmany times to iterate a star. This method is not suitable for a streamingregular expression engine. The input string is read first from end-to-frontand then from front-to-end, this can not be done without storing the string.

8.2.2 Backtracking

Backtracking is a way of simulating a NFA. It is the method employed bymany programming languages and libraries, such as Perl and PCRE. Com-pared to other methods, this has the advantage of allowing backreferences,but it is also a worst-case exponential-time algorithm. An example of a

63

8 Related work

1

2ε

3

ε

a

Figure 36: NFA for a*

matching that exhibits worst-case behavior is the regular expression a?nan

and the string an, where superscripts denotes string repetition. This exam-ple is also known from the article [4].A backtracking algorithm works depth-first. It has one active state (AS)and a string pointer (SP) and a stack of save-points. Every time we have tomake a choice as to which transition to take when traversing the NFA, wesave the state so we can later return and explore the alternate routes. Eachsave-point consists of a state and a pointer to the string.

Example 11 (Backtracking). In this example we will be matching a* to thestring aa. The NFA we will be using for this example is in figure 36. Theprocess of matching could look like:

64

8 Related work

AS SP Save-points Explanation1 aa Initially AS is set to the start state and SP is

set to the first character in the string2 aa (1, aa) Two paths are available, so we save the state

and take the ε-transition to state 2.1 aa (1, aa) We take the transition back to state 1 and

consume the first a.

2 aa(1, aa)(1, aa)

Two paths are available, so we save the stateand take the ε-transition to state 2.

1 aa(1, aa)(1, aa)

We take the transition back to state 1 andconsume the second a.

2 aa(1, aa)(1, aa)(1, aa )

Two paths are available, so we save the stateand take the ε-transition to state 2.

1 aa(1, aa)(1, aa )

No transitions are available from state 2, sothis path is abandoned. We backtrack andpop a save-state.

3 aa(1, aa)(1, aa )

The other available ε-transition to state 3 istaken.

1 aa (1, aa ) No transitions are available from state 3, sothis path is abandoned. We backtrack andpop a save-state.

3 aa (1, aa ) The other available ε-transition to state 3 istaken.

1 aa No transitions are available from state 3, sothis path is abandoned. We backtrack andpop a save-state.

3 aa The other available ε-transition to state 3 istaken. We have reached the end of the stringand are in an accepting state: We have amatch!

∗

8.3 Virtual machine

Another popular method of matching regular expressions to text is the vir-tual machine approach [5]. Instead of constructing a automaton, we gener-ate byte-code for an interpreter.A simple virtual machine would have the ability to execute threads, eachthread consisting of a regular expression program. Each thread would

65

8 Related work

Table 15: Code sequences

a char a

e1e2 codes for e1codes for e2

e1|e2 split L1, L2L1: codes for e1

jmp L3L2: codes for e2L3:

e∗ L1: split L2, L3L2: codes for e

jmp L1L3:

maintain a program counter (PC) and a string pointer (SP). A regular ex-pression program could for example consist of the following instructions:

char c If the SP does not point to a c character, then this thread of execu-tion is abandoned. Otherwise, the SP and the PC is advanced.

match Stop thread, we have a match.

jmp x PC is set to x.

split x, y Split the thread of execution. The new threads PC is set to xand the old threads PC is set to y.

With these few and simple instructions we are able to compile regular ex-pressions with concatenation, alternation and repetition, see table 15.

Example 12 (Virtual machine). We are now ready for a small examplematch, we can match the regular expression a* with the string aa. Theregular expressions compiles to

0 split 1, 41 char a2 jmp 04 match

Running this on a virtual machine could look like

66

8 Related work

Thread PC SP ExplanationT1 0 aa Create thread T2 with PC set to 4 and SP at

aa. T1 continues execution at 1.T1 1 aa Character matches, SP and PC is advanced.T1 2 aa PC is set to 0.T1 0 aa Create thread T3 with PC set to 4 and SP at

aa. T1 continues execution at 1.T1 1 aa Character matches, SP and PC is advanced.T1 2 aa PC is set to 0.T1 0 aa Create thread T4 with PC set to 4 and SP at

aa . T1 continues execution at 1.T1 1 aa Character does not match and this thread is

abandoned.T2 4 aa We have a match.T3 4 aa We have a match.T4 4 aa We have a match.

∗

We believe this is how Perl matches [11]. Running the program with de-bug mode on makes Perl print a textual representation of the byte-code theregular expression is compiled into. In section C on page 74 we have asmall Perl experiment demonstrating this feature. The results of runningthe program:

$ ./regexmach.plCompiling REx "a*"Final program:

1: STAR (4)2: EXACT <a> (0)4: END (0)

minlen 0Freeing REx: "a*"

67

9 Future work

9 Future work

In the following we will explain in more detail the areas that could be inter-esting to continue in after this project. We have described some extensionsto the regular expression, internationalization, a few extra filters and havea few notes on concurrency.

9.1 Extending the current regular expression feature-set

Regular expressions in real world usage is more complicated than what wehave described in this thesis. Here is a list with some of the features thatthe regular expressions presented here could be extended with.

Counted repetitions A shorthand for matching at least n, but no morethan m times can easily be implemented. Using industry standard notation,the repetition e{3} expands to eee, e{2, 5} expands to eee?e?e? and e{2, }expands to eee∗, where e is some regular expression.

Non-greedy or lazy quantifiers Traditionally the quantifiers will matchas much as possible, they are greedy. Non-greedy quantifiers will match aslittle as possible.

Character class shorthands Often used character classes, like [A-Za-z0-9 ] for a word character, often has shorthands. It makes for shorter andmore readable regular expressions.

Unanchored matches In this thesis we have assumed that the regular ex-pression will match the whole of the string. In practice, it is often use-ful to find out if the regular expression matches a substring of the inputstring. An unanchored match has implicit non-greedy .* appended andprepended. Here it would also be very useful to have the ability to restartmatching where the previous left of.

More escape sequences Special escape sequences for characters like tab,newline, return, form feed, alarm and escape are useful. They are morereadable in a regular expression than the actual characters themselves, be-cause they will show as blank space in most editors.

Assertions Assertions does not consume characters from the input string.They assert properties about the surrounding text. Most languages andlibraries provide the start and end of line and word boundary assertions.There are also general assertions, like the lookahead assertion (?=r) whichasserts that the text after the current matches r.

68

9 Future work

Case insensitive matching This could be implemented as the poor man’sversion of the case insensitive match: Every character a matched case in-sensitively is expanded to a character class [aA]. This might not be the bestidea performance wise, since the character classes are so expensive to sim-ulate.

9.2 Internationalization

What this long word covers over is basically integrating other character setsthan the ASCII. Internationalization also makes character class shorthandsmean different things. The word character class from above would varyaccording to locale, for example in a danish setting it would make moresense defining it as: [A-Aa-a0-9 ].

9.3 More and better filters

Copy A copy filter can easily be implemented, just write input to twooutput channels. These channels could be standard input and error. Aneffect very similar can already be achieved with the utility tee, this willread from standard input and write to standard output and files.

Serialize The serialization filter dumps the contents of the captured groupswith no formatting. It would be beneficial to the user to have some kind offormatting to distinguish the different captured groups.

Applying regular expressions line by line It would not be enough to justinsert a end-of-stream character after every new-line. We would probablyhave to rethink the protocol as well. We would need some way of signalingto the programs downstream that this is a new match, but the same regularexpression should be applied.

9.4 Concurrency

We can not split up neither the regular expression nor the input string forconcurrent processing. To split up the regular expression, we would needto know exactly how much the sub-regular expressions each consume ofthe input string. This we cannot know unless we actually do the match.Splitting up the string instead would also require knowledge we can onlygain by actually performing the match, we would need to know in whatstate the simulation process should start at this particular input string sym-bol.The current setup is a pipeline. Only the streaming filters are able to pro-cess data upon reception, the non-streaming filters gathers all data in theinput stream before beginning the processing. On most Unix-like systems

69

9 Future work

it is possible to chain programs together with pipes. The output of a pro-gram is directly fed to the input of the next program in the chain. This isusually implemented so that all the programs in the chain is started at thesame time. The scheduler is then responsible for managing the processes.Although the problem is not parallel in nature, by dividing the solutioninto discrete components we can at least utilize each streaming componentsimultaneously.To gain more control over the concurrency we could implement the pro-grams as processes. This would allow us to tweak the scheduling even fur-ther. It is however doubtful that this will give a performance boost basedon better control over the concurrency, as the scheduler already does a goodjob of this. The performance boost will more likely come from faster com-munication channels.In a Unix environment there will be no loss of data in a pipeline, if forexample a program can produce data faster than the receiving programcan read. The data will be buffered until the receiving program is ready toread the data. If the buffer fills up, the producer will be suspended untilthere again is room in the buffer. This could mean we are buffering datatwice, once in the buffer used by the operating system between programsin a pipeline and once in the buffer used by the programs own input andoutput.

70

10 Conclusion

10 Conclusion

In this thesis we have designed and demonstrated a prototype of our de-sign and compared it to existing implementations of regular expression en-gines. We have explained the reasoning behind our design and reasonedabout theoretical performance, both in terms of run-time and storage re-quirements. We then implemented the design, and demonstrated its via-bility in practice, and that our expected asymptotic bounds holds.We have discussed a weak point in the design regarding the size of themixed bit-values output which in turn determines the run-time and in onecase the memory consumption of the filters. We have only observed a linearrelationship between the size of the input string and the size of the mixedbit-values. A worst-case analysis of the size tells us that the size could beas big as the product of the size of the input string and the regular expres-sion. Practically speaking, most simulations will however not fall into thiscategory.A drawback for ordinary users is the need for rewriting the regular expres-sion for some of the filters. This has to be done by the user according toa rewriting function, it is a fairly straightforward procedure, but it is a bighindrance for the design presented here to be used in a more mainstreamsetting.In addition to this, we have also discussed regular expression and finiteautomatons, and conducted an analysis of Dube and Feeleys algorithm toadapt it to our purposes, namely to communicate progress between filters.Finally, we would like to note that while this project obviously only consti-tutes a prototype, the concept has some interesting potential applications.Creative use of the ’trace’ and ’serialize’ filters can, in some cases, be usedfor compression, to name one example. The idea is that, if you have a reg-ular expression matching a string, the resulting bit-values will take up lessspace than the string itself.

71

REFERENCES

References

[1] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: principles,techniques, and tools. Addison-Wesley Longman Publishing Co., Inc.,Boston, MA, USA, 1986.

[2] Thomas H. Cormen, Clifford Stein, Ronald L. Rivest, and Charles E.Leiserson. Introduction to Algorithms. McGraw-Hill Higher Education,2nd edition, 2001.

[3] Russ Cox. Regular expression implementation, 2007. http://swtch.com/˜rsc/regexp/nfa.c.txt.

[4] Russ Cox. Regular Expression Matching Can Be Simple And Fast (butis slow in Java, Perl, PHP, Python, Ruby, ...), 2007. http://swtch.com/˜rsc/regexp/regexp1.html.

[5] Russ Cox. Regular Expression Matching: the Virtual MachineApproach, 2009. http://swtch.com/˜rsc/regexp/regexp2.html.

[6] Danny Dube and Marc Feeley. Efficiently building a parse tree from aregular expression. Acta Informatica, 37(2):121–144, September 2000.

[7] Alain Frisch and Luca Cardelli. Greedy regular expression matching.In Josep Dıaz, Juhani Karhumaki, Arto Lepisto, and Donald Sannella,editors, Automata, Languages and Programming: 31st International Col-loquium, ICALP 2004, Turku, Finland, July 12-16, 2004. Proceedings, vol-ume 3142 of Lecture Notes in Computer Science, pages 618–629. Springer,2004.

[8] Fritz Henglein and Lasse Nielsen. Declarative coinductive axiomati-zation of regular expression containment and its computational inter-pretation (preliminary version). 2010.

[9] Jeffrey D. Hopcroft, John E. And Motwani, Rajeev And Ullman. Intro-duction to automata theory, languages, and computation. Addison-Wesley,2nd editio edition, 2001.

[10] Ville Laurikari. Efficient submatch addressing for regular expressions.2001.

[11] Yves Orton. perlreguts, 2006. http://perldoc.perl.org/perlreguts.html.

[12] Line Bie Pedersen. Regular expression libraries, tools and applica-tions. 2010.

72

http://swtch.com/~rsc/regexp/nfa.c.txt

http://swtch.com/~rsc/regexp/nfa.c.txt

http://swtch.com/~rsc/regexp/regexp1.html




http://perldoc.perl.org/perlreguts.html

http://perldoc.perl.org/perlreguts.html

REFERENCES

[13] Ken Thompson. Regular Expression Search Algorithm. Commun.ACM, 11(6):419–422, 1968.

[14] Margus Veanes, Peli de Halleux, and Nikolai Tillmann. Rex: Symbolicregular expression explorer. Software Testing, Verification, and Valida-tion, 2008 International Conference on, 0:498–507, 2010.

[15] Larry Wall. Apocalypse 5: Pattern Matching, 2002. http://dev.perl.org/perl6/doc/design/apo/A05.html.

[16] Jan Wiberg. Grid replicated storage for the minimum intrusion grid.2010.

73

http://dev.perl.org/perl6/doc/design/apo/A05.html

http://dev.perl.org/perl6/doc/design/apo/A05.html

D Optimization scripts

A Test computer specifications

In this section we put the relevant parts of the specifications of the com-puter that was used for testing and benchmarking.

Software versions

Operating system Ubuntu 10.10 - the Maverick Meerkat

gcc (Ubuntu/Linaro 4.4.4-14ubuntu5) 4.4.5

perl 5.10.1 (*) built for i686-linux-gnu-thread-multi

tcl 8.4.16-2

re2 Version present in the repository at the date of fetching: 2. March2011.

g++ Ubuntu/Linaro 4.4.4-14ubuntu5) 4.4.5

Technical specifications

CPU Intel(R) Core(TM) i3 CPU M 330 @ 2.13GHz

Memory 1938 MB

Storage Samsung HM250HI

B Huffman trees

C Experiments

Perls debug output

regexmach.pl

#! /usr/bin/perl -Wall

use strict;use re Debug => ’DUMP’;

"aaa" =˜ /a*/;


re2match.cc

74


'0'

11

'1'

11

':'

22

'|'

11

'='

11

'\'

11

'b'

11

't'

0

11

0 1

22

0 1

22

0 1

22

0 1

44

0 1

44

0 1

88

0 1

Figure 37: Huffman tree for frequencies in table 4, .*

75


'0'

13

'1'

13

':'

27

'|'

2

'='

13

'\'

8

'b'

13

't'

0

2

0 1

10

0 1

23

0 1

26

0 1

36

0 1

53

0 1

89

0 1

Figure 38: Huffman tree for frequencies in table 4, (?:(?:(?:[a-zA-Z]+ ?)+[,.;:]?)*..)*

76


#include <re2/re2.h>#include <stdio.h>#include <string>#include <iostream>

using namespace re2;

intmain(int argc, char *argv[]){string s;

if(RE2::FullMatch(argv[2], argv[1], &s)) {std::cout << s;return 0;

}return 1;

}

re2matchonly.cc

#include <re2/re2.h>#include <stdio.h>#include <string>#include <iostream>

using namespace re2;

intmain(int argc, char *argv[]){string s;

if(RE2::FullMatch(argv[2], argv[1])) {std::cout << "t";

}else {

std::cout << "b";}return 0;

}

perlmatch.cc

#! /usr/bin/perl

if($ARGV[1] =˜ $ARGV[0]){print "$1";

77


}

perlmatchonly.pl

#! /usr/bin/perl

if($ARGV[1] =˜ $ARGV[0]){print "t";

} else {print "b";

}

tclmatch.tcl

#! /usr/bin/tclsh

if {[regexp [lindex $argv 0] [lindex $argv 1] result]} {puts $result

}

tclmatchonly.tcl

#! /usr/bin/tclsh

if {[regexp [lindex $argv 0] [lindex $argv 1]]} {puts "t"

} else {puts "b"

}

Memory usage

memoryusage.pl


use strict;

my $regex = ’(.*)’;my $regex2 = ’|(.*)’;

# Generate the files‘cat lipsum.txt | ./main ’$regex’ > ˜/speciale/memory.main.

mbv‘;‘./groupings_all ’$regex’ < ˜/speciale/memory.main.mbv > ˜/

speciale/memory.groupings_all.mbv‘;‘./trace < ˜/speciale/memory.groupings_all.mbv > ˜/speciale/

memory.trace.mbv‘;‘./serialize ’$regex2’ < ˜/speciale/memory.trace.mbv > ˜/

speciale/memory.serialize.mbv‘;

78


# Collect memory usage data using massif‘cat lipsum.txt | valgrind --tool=massif --stacks=yes ./main

’$regex’‘;‘valgrind --tool=massif --stacks=yes ./ismatch < ˜/speciale/

memory.main.mbv‘;‘valgrind --tool=massif --stacks=yes ./groupings_all ’$regex

’ < ˜/speciale/memory.main.mbv‘;‘valgrind --tool=massif --stacks=yes ./trace2 < ˜/speciale/

memory.groupings_all.mbv‘;‘valgrind --tool=massif --stacks=yes ./trace < ˜/speciale/

memory.groupings_all.mbv‘;‘valgrind --tool=massif --stacks=yes ./serialize ’$regex2’ <

˜/speciale/memory.trace.mbv‘;

Runtimes

runtimes.pl


use strict;use Time::HiRes ’time’;

my $regex = "(.*)";my $regex2 = "|(.*)";my $startTime, my $endTime, my $result;

# Generate the files‘cat ˜/speciale/shorttrace | ./main ’$regex’ > ˜/speciale/

runtime.main.mbv‘;‘./groupings_all ’$regex’ < ˜/speciale/runtime.main.mbv >

˜/speciale/runtime.groupings_all.mbv‘;‘./trace < ˜/speciale/runtime.groupings_all.mbv > ˜/speciale

/runtime.trace.mbv‘;‘./serialize ’$regex2’ < ˜/speciale/runtime.trace.mbv > ˜/

speciale/runtime.serialize.mbv‘;

my $runs = 5;my $i;my $best;

for ($i = 0; $i < $runs; $i++){print $i;$startTime = time();‘cat ˜/speciale/shorttrace | ./main ’$regex’‘;$endTime = time();

if($i == 0 || ($endTime - $startTime) < $best){$best = $endTime - $startTime;

79


}}

printf("Main: takes %.3f seconds.\n", $best);

for ($i = 0; $i < $runs; $i++){print $i;$startTime = time();‘./groupings_all ’$regex’ < ˜/speciale/runtime.main.mbv‘;$endTime = time();


}}

printf("groupings_all: takes %.3f seconds.\n", $best);

for ($i = 0; $i < $runs; $i++){print $i;$startTime = time();‘./ismatch < ˜/speciale/runtime.main.mbv‘;$endTime = time();


}}

printf("ismatch: takes %.3f seconds.\n", $best);

for ($i = 0; $i < $runs; $i++){print $i;$startTime = time();‘./trace < ˜/speciale/runtime.groupings_all.mbv‘;$endTime = time();


}}

printf("trace: takes %.3f seconds.\n", $best);

$startTime = time();‘./trace2 < ˜/speciale/runtime.groupings_all.mbv‘;

80

E Benchmark scripts

$endTime = time();

$best = $endTime - $startTime;

printf("trace2: takes %.3f seconds.\n", $best);

for ($i = 0; $i < $runs; $i++){print $i;$startTime = time();‘./serialize ’$regex2’ < ˜/speciale/runtime.trace.mbv‘;$endTime = time();


}}

printf("serialize: takes %.3f seconds.\n", $best);

Profiling

profiling.pl

#! /usr/bin/perl -W

use strict;

my $regex = ’(.*)’;my $regex2 = ’|(.*)’;

‘cat ˜/speciale/xac | ./main ’(.*)’‘;‘mv gmon.out gmon.main.out‘;‘cat ˜/speciale/xac | ./main ’(.*)’ | ./groupings_all ’(.*)’

‘;‘mv gmon.out gmon.groupings_all.out‘;‘cat ˜/speciale/shorttrace | ./main ’(.*)’ | ./groupings_all

’(.*)’ | ./trace2‘;‘mv gmon.out gmon.trace.out‘;‘cat ˜/speciale/xac | ./main ’(.*)’ | ./groupings_all ’(.*)’

| ./trace | ./serialize ’|(.*)’‘;‘mv gmon.out gmon.serialize.out‘;

E Benchmark scripts

backtrackingworstcase.pl


81

E Benchmark scripts

use POSIX;use strict;use Time::HiRes ’time’;

my $startTime, my $endTime, my $result;

my $runs = 10;my $i, my $j;my $best;my $n;

print "main";print " n sec";for($i = 1; $i <= 10; $i++){

$n = $i * 500;my $regex = "a?" x $n . "a" x $n;my $text = "a" x $n;

for ($j = 0; $j < $runs; $j++){$startTime = time();$result = ‘echo -n ’$text’ | ./main ’$regex’ | ./

ismatch‘;$endTime = time();

if($j == 0 || ($endTime - $startTime) < $best){$best = $endTime - $startTime;

}}chomp $result;print "ARGH" unless $result eq ’t’;printf("%4i %.3f\n", $n, $best);

}

print "perl";print " n sec";for($i = 1; $i <= 10; $i++){

$n = ceil($i * 2.5);my $regex = "a?" x $n . "a" x $n;my $text = "a" x $n;

for ($j = 0; $j < $runs; $j++){$startTime = time();$result = ‘./perlmatchonly.pl ’$regex’ ’$text’‘;$endTime = time();


}

82

E Benchmark scripts

}chomp $result;print "ARGH" unless $result eq ’t’;printf("%4i %.3f\n", $n, $best);

}

print "re2";print " n sec";for($i = 1; $i <= 10; $i++){


for ($j = 0; $j < $runs; $j++){$startTime = time();$result = ‘./re2matchonly ’$regex’ ’$text’‘;$endTime = time();



}

print "tcl";print " n sec";for($i = 1; $i <= 10; $i++){


for ($j = 0; $j < $runs; $j++){$startTime = time();$result = ‘./tclmatchonly.tcl ’$regex’ ’$text’‘;$endTime = time();



}

83

E Benchmark scripts

backtrackingworstcase mem.pl




my $i, my $j;my $best;my $n;

for($i = 1; $i <= 10; $i++){$n = $i * 500;my $regex = "a?" x $n . "a" x $n;my $text = "a" x $n;

‘echo -n ’$text’ | valgrind --tool=massif --pages-as-heap=yes --massif-out-file=’backtrackingworstcase/main.$n.out’ ./main ’$regex’ | valgrind --tool=massif --pages-as-heap=yes --massif-out-file=’backtrackingworstcase/ismatch.$n.out’ ./ismatch‘;

}

for($i = 1; $i <= 10; $i++){$n = ceil($i * 2.5);my $regex = "a?" x $n . "a" x $n;my $text = "a" x $n;

‘valgrind --tool=massif --pages-as-heap=yes --massif-out-file=’backtrackingworstcase/perl.$n.out’ ./perlmatchonly.pl ’$regex’ ’$text’‘;

}


‘valgrind --tool=massif --pages-as-heap=yes --massif-out-file=’backtrackingworstcase/re2.$n.out’ ./re2matchonly’$regex’ ’$text’‘;

}

84

E Benchmark scripts


‘valgrind --tool=massif --pages-as-heap=yes --massif-out-file=’backtrackingworstcase/tcl.$n.out’ ./tclmatchonly.tcl ’$regex’ ’$text’‘;

}

dfaworstcase.pl




my $runs = 10;my $i, my $j;my $best;my $n;

print "main";print " n sec";for($i = 1; $i <= 10; $i++){

$n = $i * 500;my $regex = "(a|b)*a" . "(a|b)" x $n;my $text = "a" . "a" x $n;

for ($j = 0; $j < $runs; $j++){$startTime = time();$result = ‘echo -n ’$text’ | ./main ’$regex’ | ./

ismatch‘;$endTime = time();



}

print "perl";print " n sec";

85

E Benchmark scripts

for($i = 1; $i <= 10; $i++){$n = $i * 500;my $regex = "(a|b)*a" . "(a|b)" x $n;my $text = "a" . "a" x $n;

for ($j = 0; $j < $runs; $j++){$startTime = time();$result = ‘./perlmatchonly.pl ’$regex’ ’$text’‘;$endTime = time();



}

print "re2";print " n sec";for($i = 1; $i <= 10; $i++){


for ($j = 0; $j < $runs; $j++){$startTime = time();$result = ‘./re2matchonly ’$regex’ ’$text’‘;$endTime = time();



}

print "tcl";print " n sec";for($i = 1; $i <= 10; $i++){


for ($j = 0; $j < $runs; $j++){$startTime = time();

86

E Benchmark scripts

$result = ‘./tclmatchonly.tcl ’$regex’ ’$text’‘;$endTime = time();



}

dfaworstcase mem.pl



my $result;

my $i, my $j;my $n;


$result = ‘echo -n ’$text’ | valgrind --tool=massif --pages-as-heap=yes --massif-out-file=’dfaworstcase/main.$n.out’ ./main ’$regex’ | valgrind --tool=massif --pages-as-heap=yes --massif-out-file=’dfaworstcase/ismatch.$n.out’ ./ismatch‘;chomp $result;print "ARGH" unless $result eq ’t’;

}


$result = ‘valgrind --tool=massif --pages-as-heap=yes --massif-out-file=’dfaworstcase/perl.$n.out’ ./perlmatchonly.pl ’$regex’ ’$text’‘;

chomp $result;

87

E Benchmark scripts

print "ARGH" unless $result eq ’t’;}


$result = ‘valgrind --tool=massif --pages-as-heap=yes --massif-out-file=’dfaworstcase/re2.$n.out’ ./re2matchonly’$regex’ ’$text’‘;

chomp $result;print "ARGH" unless $result eq ’t’;

}


$result = ‘valgrind --tool=massif --pages-as-heap=yes --massif-out-file=’dfaworstcase/tcl.$n.out’ ./tclmatchonly.tcl ’$regex’ ’$text’‘;

chomp $result;print "ARGH" unless $result eq ’t’;

}

email.pl


use strict;use Time::HiRes ’time’;use String::Random;

my $regex = ’([A-Za-z0-9](?:(?:[_.-]?[a-zA-Z0-9]+)*)@(?:[A-Za-z0-9]+)(?:(?:[.-]?[a-zA-Z0-9]+)*)\.(?:[A-Za-z][A-Za-z]+))’;

my $regex2 = ’|([A-Za-z0-9](?:(?:[_.-]?[a-zA-Z0-9]+)*)@(?:[A-Za-z0-9]+)(?:(?:[.-]?[a-zA-Z0-9]+)*)\.(?:[A-Za-z][A-Za-z]+))’;


my $foo = new String::Random;$foo->{’A’} = [ ’A’..’Z’, ’a’..’z’, ’0’..’9’ ];$foo->{’B’} = [ ’_’, ’.’, ’-’ ];$foo->{’D’} = [ ’.’, ’-’ ];

88

E Benchmark scripts

$foo->{’E’} = [ ’A’..’Z’, ’a’..’z’ ];$foo->{’F’} = [ ’@’ ];$foo->{’G’} = [ ’.’ ];

my $n;

#my $email = $foo->randpattern("A" . ("B" x int(rand(2)) . "A" x int(rand($n))) x int(rand($n)) . "FA" . ("D" x int(rand(2)) . "A" x int(rand($n))) x int(rand($n)) . "GEE". "E" x int(rand($n)));

#my $email = $foo->randpattern("A" . ("B" . ("A" x $n)) x $n. "FA" . "D" . (("A" x $n) x $n) . "GEE" . ("E" x $n));

my @email;my $runs = 10;my $i, my $j;my $best;my $factor = 25;

for($i = 1; $i <= 10; $i++){$n = $i * $factor;$email[$i] = $foo->randpattern("A" . ("B" . ("A" x $n))x $n . "FA" . "D" . (("A" x $n) x $n) . "GEE" . ("E" x$n));

}

print "main";print " len sec";for($j = 1; $j <= 10; $j++){

for($i = 0; $i < $runs; $i++){$startTime = time();$result = ‘echo -n ’$email[$j]’ | ./main ’$regex’ |

./groupings_all ’$regex’ | ./trace | ./serialize ’$regex2’ ‘;

$endTime = time();


}}print "ARGH" unless $result eq $email[$j];printf("%7i %.3f\n", length($email[$j]), $best);

}

print "tcl";print " len sec";for($j = 1; $j <= 10; $j++){

for($i = 0; $i < $runs; $i++){

89

E Benchmark scripts

$startTime = time();$result = ‘./tclmatch.tcl ’$regex’ ’$email[$j]’‘;$endTime = time();


}}chomp($result);print "ARGH" unless $result eq $email[$j];printf("%7i %.3f\n", length($email[$j]), $best);

}

print "re2";print " len sec";for($j = 1; $j <= 10; $j++){

for($i = 0; $i < $runs; $i++){$startTime = time();$result = ‘./re2match ’$regex’ ’$email[$j]’‘;$endTime = time();



}

print "perl";print " len sec";for($j = 1; $j <= 10; $j++){

for($i = 0; $i < $runs; $i++){$startTime = time();$result = ‘./perlmatch.pl ’$regex’ ’$email[$j]’‘;$endTime = time();



}

email mem.pl


90

E Benchmark scripts




my $result;

my $foo = new String::Random;$foo->{’A’} = [ ’A’..’Z’, ’a’..’z’, ’0’..’9’ ];$foo->{’B’} = [ ’_’, ’.’, ’-’ ];$foo->{’D’} = [ ’.’, ’-’ ];$foo->{’E’} = [ ’A’..’Z’, ’a’..’z’ ];$foo->{’F’} = [ ’@’ ];$foo->{’G’} = [ ’.’ ];

my $n;

#my $email = $foo->randpattern("A" . ("B" x int(rand(2)) . "A" x int(rand($n))) x int(rand($n)) . "FA" . ("D" x int(rand(2)) . "A" x int(rand($n))) x int(rand($n)) . "GEE". "E" x int(rand($n)));

#my $email = $foo->randpattern("A" . ("B" . ("A" x $n)) x $n. "FA" . "D" . (("A" x $n) x $n) . "GEE" . ("E" x $n));

my @email;my $i, my $j;my $factor = 25;


}

for($j = 1; $j <= 10; $j++){my $n = length($email[$j]);$result = ‘echo -n ’$email[$j]’ | valgrind --tool=massif--pages-as-heap=yes --massif-out-file=’email/main.$n.

out’ ./main ’$regex’ | valgrind --tool=massif --pages-as-heap=yes --massif-out-file=’email/groupings_all.$n.out’./groupings_all ’$regex’ | valgrind --tool=massif --

pages-as-heap=yes --massif-out-file=’email/trace.$n.out’

91

E Benchmark scripts

./trace | valgrind --tool=massif --pages-as-heap=yes --massif-out-file=’email/serialize.$n.out’ ./serialize ’$regex2’ ‘;print "ARGH" unless $result eq $email[$j];

}

for($j = 1; $j <= 10; $j++){my $n = length($email[$j]);$result = ‘valgrind --tool=massif --pages-as-heap=yes --massif-out-file=’email/tcl.$n.out’ ./tclmatch.tcl ’$regex’ ’$email[$j]’‘;chomp($result);print "ARGH" unless $result eq $email[$j];

}

for($j = 1; $j <= 10; $j++){my $n = length($email[$j]);$result = ‘valgrind --tool=massif --pages-as-heap=yes --massif-out-file=’email/re2.$n.out’ ./re2match ’$regex’ ’$email[$j]’‘;print "ARGH" unless $result eq $email[$j];

}

for($j = 1; $j <= 10; $j++){my $n = length($email[$j]);$result = ‘valgrind --tool=massif --pages-as-heap=yes --massif-out-file=’email/perl.$n.out’ ./perlmatch.pl ’$regex’ ’$email[$j]’‘;print "ARGH" unless $result eq $email[$j];

}

email mbvsize.pl






92

E Benchmark scripts

my $foo = new String::Random;$foo->{’A’} = [ ’A’..’Z’, ’a’..’z’, ’0’..’9’ ];$foo->{’B’} = [ ’_’, ’.’, ’-’ ];$foo->{’D’} = [ ’.’, ’-’ ];$foo->{’E’} = [ ’A’..’Z’, ’a’..’z’ ];$foo->{’F’} = [ ’@’ ];$foo->{’G’} = [ ’.’ ];

my $n;

my @email;my $runs = 1;my $i, my $j;my $best;my $factor = 25;


}

print " main groupings_all traceserialize";

for($j = 1; $j <= 10; $j++){my $n = length($email[$j]);‘echo -n ’$email[$j]’ | ./main ’$regex’ > main.mbv.tmp‘;‘./groupings_all ’$regex’ < main.mbv.tmp > groupings_all.mbv.tmp‘;‘./trace < groupings_all.mbv.tmp > trace.mbv.tmp‘;‘./serialize ’$regex2’ < trace.mbv.tmp > serialize.mbv.tmp‘;

printf("%13i %13i %13i %13i\n", -s ’main.mbv.tmp’,-s ’groupings_all.mbv.tmp’, -s ’trace.mbv.tmp’,-s ’serialize.mbv.tmp’);

}

number.pl



93

E Benchmark scripts

my $regex = ’([+-]?(?:[0-9]*\.?[0-9]+|[0-9]+\.?[0-9]*)(?:[eE][+-]?[0-9]+)?)’;

my $regex2 = ’|([+-]?(?:[0-9]*\.?[0-9]+|[0-9]+\.?[0-9]*)(?:[eE][+-]?[0-9]+)?)’;


my $n;

my @number;my $runs = 1;my $i, my $j;my $best;my $factor = 1000;

for($j = 1; $j <= 10; $j++){$n = $factor * $j;$number[$j] = ’+’ . int(rand($n)) x $n . ’.’ . int(rand($n)) x $n . ’E-’ . int(rand($n)) x $n;

}

print "main";print " len sec";for($j = 1; $j <= 10; $j++){

for($i = 0; $i < $runs; $i++){$startTime = time();$result = ‘./main ’$regex’ ’$number[$j]’ | ./

groupings_all ’$regex’ | ./trace | ./serialize ’$regex2’‘;

$endTime = time();if($i == 0 || ($endTime - $startTime) < $best){

$best = $endTime - $startTime;}

}print "ARGH" unless $result eq $number[$j];printf("%7i %.3f\n", length($number[$j]), $best);

}

print "tcl";print " len sec";for($j = 1; $j <= 10; $j++){

for($i = 0; $i < $runs; $i++){$startTime = time();$result = ‘./tclmatch.tcl ’$regex’ ’$number[$j]’‘;$endTime = time();


}

94

E Benchmark scripts

}chomp($result);print "ARGH" unless $result eq $number[$j];printf("%7i %.3f\n", length($number[$j]), $best);

}

print "re2";print " len sec";for($j = 1; $j <= 10; $j++){

for($i = 0; $i < $runs; $i++){$startTime = time();$result = ‘./re2match ’$regex’ ’$number[$j]’‘;$endTime = time();


}}print "ARGH" unless $result eq $number[$j];printf("%7i %.3f\n", length($number[$j]), $best);

}

print "perl";print " len sec";for($j = 1; $j <= 10; $j++){

for($i = 0; $i < $runs; $i++){$startTime = time();$result = ‘./perlmatch.pl ’$regex’ ’$number[$j]’‘;$endTime = time();


}}print "ARGH" unless $result eq $number[$j];printf("%7i %.3f\n", length($number[$j]), $best);

}

number mem.pl



my $regex = ’([+-]?(?:[0-9]*\.?[0-9]+|[0-9]+\.?[0-9]*)(?:[eE][+-]?[0-9]+)?)’;

95

E Benchmark scripts

my $regex2 = ’|([+-]?(?:[0-9]*\.?[0-9]+|[0-9]+\.?[0-9]*)(?:[eE][+-]?[0-9]+)?)’;


my $n;

my @number;my $i, my $j;my $factor = 1000;


}

for($j = 1; $j <= 10; $j++){my $n = length($number[$j]);‘valgrind --tool=massif --pages-as-heap=yes --massif-out-file=’number/main.$n.out’ ./main ’$regex’ ’$number[$j]’| valgrind --tool=massif --pages-as-heap=yes --massif-

out-file=’number/groupings_all.$n.out’ ./groupings_all ’$regex’ | valgrind --tool=massif --pages-as-heap=yes --massif-out-file=’number/trace.$n.out’ ./trace | valgrind--tool=massif --pages-as-heap=yes --massif-out-file=’

number/serialize.$n.out’ ./serialize ’$regex2’ ‘;}

for($j = 1; $j <= 10; $j++){my $n = length($number[$j]);‘valgrind --tool=massif --pages-as-heap=yes --massif-out-file=’number/tcl.$n.out’ ./tclmatch.tcl ’$regex’ ’$number[$j]’‘;

}

for($j = 1; $j <= 10; $j++){my $n = length($number[$j]);‘valgrind --tool=massif --pages-as-heap=yes --massif-out-file=’number/re2.$n.out’ ./re2match ’$regex’ ’$number[$j]’‘;

}

for($j = 1; $j <= 10; $j++){my $n = length($number[$j]);‘valgrind --tool=massif --pages-as-heap=yes --massif-out-file=’number/perl.$n.out’ ./perlmatch.pl ’$regex’ ’

96

F Source code

$number[$j]’‘;}

number mbvsize.pl



my $regex = ’([+-]?(?:[0-9]*\.?[0-9]+|[0-9]+\.?[0-9]*)(?:[eE][+-]?[0-9]+)?)’;

my $regex2 = ’|([+-]?(?:[0-9]*\.?[0-9]+|[0-9]+\.?[0-9]*)(?:[eE][+-]?[0-9]+)?)’;

my $n;

my @number;my $i, my $j;my $factor = 1000;


}

print " main groupings_all traceserialize";

for($j = 1; $j <= 10; $j++){‘echo -n ’$number[$j]’ | ./main ’$regex’ > main.mbv.tmp‘;‘./groupings_all ’$regex’ < main.mbv.tmp > groupings_all.mbv.tmp‘;‘./trace < groupings_all.mbv.tmp > trace.mbv.tmp‘;‘./serialize ’$regex2’ < trace.mbv.tmp > serialize.mbv.tmp‘;

printf("%13i %13i %13i %13i\n", -s ’main.mbv.tmp’,-s ’groupings_all.mbv.tmp’, -s ’trace.mbv.tmp’,-s ’serialize.mbv.tmp’);

}

F Source code

graphviz.c

#include <stdio.h>

97

F Source code

#include <graphviz/gvc.h>#include "util.h"#include "nfa.h"

#define LABEL_BUF_SIZE 11

int print_counter;char label_buf[LABEL_BUF_SIZE];//char *red = "red";//char *black = "black";//char *green = "green";

void_print_range(struct Range *r, enum Boolean is_negated, char

*buf, int size){int i;

i = 0;if(i >= (size - 1))

goto end;buf[i++] = ’[’;

if(i >= (size - 1))goto end;

if(is_negated)buf[i++] = ’ˆ’;

while(r != NULL){if(r->lo == r->hi){


buf[i++] = r->lo;}else{


buf[i++] = r->lo;buf[i++] = ’-’;buf[i++] = r->hi;

}r = r->next;

}


buf[i++] = ’]’;end:

98

F Source code

buf[i] = 0;}

void_print_nfa(struct State *s, graph_t *g, Agnode_t *prev, char

*label,char *color)

{Agnode_t *n2;Agedge_t *e;Agsym_t *a;char temp[100];

assert(s != NULL);

if(s->n != NULL) {e = agedge(g, prev, s->n);a = agedgeattr(g, "label", "");agxset(e, a->index, label);

a = agedgeattr(g, "color", "");agxset(e, a->index, color);

return;}

#if defined(END_SPLIT_MARKER) || defined(PAREN_MARKER) ||defined(END_REP_MARKER)

switch(s->type){case NFA_SPLIT:

sprintf(temp, "%i\n%i", print_counter, s->parencount);break;

case NFA_EPSILON:if(s->subtype == END_SPLIT)

sprintf(temp, "%i\n%i", print_counter, s->parencount);else

sprintf(temp, "%i", print_counter);break;

default:sprintf(temp, "%i", print_counter);

}#else

sprintf(temp, "%i", print_counter);#endif

s->id = print_counter;print_counter++;n2 = agnode(g, temp);a = agnodeattr(g, "shape", "");

99

F Source code

agxset(n2, a->index, "circle");

s->n = n2;

e = agedge(g, prev, n2);a = agedgeattr(g, "label", "");agxset(e, a->index, label);

a = agedgeattr(g, "color", "");agxset(e, a->index, color);


_print_nfa(s->out0, g, n2, "ε_0", "green");_print_nfa(s->out1, g, n2, "ε_1", "red");break;

case NFA_ACCEPTING:a = agnodeattr(g, "shape", "");agxset(n2, a->index, "doublecircle");break;

case NFA_RANGE:_print_range(s->range, s->is_negated, label_buf,LABEL_BUF_SIZE);_print_nfa(s->out0, g, n2, label_buf, "black");break;

case NFA_EPSILON:#if defined(END_SPLIT_MARKER) || defined(PAREN_MARKER) ||

defined(END_REP_MARKER)switch(s->subtype){case LEFT_PAREN:

_print_nfa(s->out0, g, n2, "ε_(", "black");break;

case RIGHT_PAREN:_print_nfa(s->out0, g, n2, "ε_)", "black");break;

default:_print_nfa(s->out0, g, n2, "ε", "black");

}#else

_print_nfa(s->out0, g, n2, "ε", "black");#endif

break;default:

label_buf[0] = s->c;label_buf[1] = 0;_print_nfa(s->out0, g, n2, label_buf, "black");break;

}}

100

F Source code

voidprint_nfa(char *filename, struct State *s){

GVC_t *gvc;graph_t *g;Agnode_t *n;Agsym_t *a;char temp[10];FILE *file;

if ((file = fopen(filename, "w")) == NULL) {printf("Could not open %s for writing\n", filename);return;

}

gvc = gvContext();

g = agopen("NFA",AGDIGRAPH);

agraphattr(g, "fontname", "Palatino");agraphattr(g, "fontsize", "11");agraphattr(g, "rankdir", "LR");agraphattr(g, "margin", "0");//agraphattr(g, "size", "5.4,8.2");

agnodeattr(g, "fontname", "Palatino");agnodeattr(g, "fontsize", "11");agnodeattr(g, "width", "0");agnodeattr(g, "height", "0");

agedgeattr(g, "fontname", "Palatino");agedgeattr(g, "fontsize", "11");

print_counter = 0;

sprintf(temp, "%i", print_counter);print_counter++;

n = agnode(g, temp);a = agnodeattr(g, "shape", "");agxset(n, a->index, "point");

_print_nfa(s, g, n, "", "black");

gvLayout(gvc, g, "dot");gvRender(gvc, g, "pdf", file);

101

F Source code

gvFreeLayout(gvc, g);agclose(g);gvFreeContext(gvc);fclose(file);

}

groupings.c

#include <stdio.h>#include <string.h>#include <stdlib.h>#include <getopt.h>#include "nfa.h"#include "util.h"#include "match.h"#include "groupings.h"

struct Channel *channel(struct State *s){

struct Channel *new;

if ((new = (struct Channel *) malloc(sizeof(struct Channel))) == NULL ) {fprintf(stderr, "Error allocating memory for channel\n");exit(1);

}

new->s = s;new->id = channel_id++;new->next = NULL;new->prev = NULL;new->parendepth = 0;

#ifdef END_REP_MARKERnew->end_rep_marker = false;new->suspend_output = 0;

#endifreturn new;

}

struct Channel *channel_copy(struct Channel *old){


if ((new = (struct Channel *) malloc(sizeof(struct Channel))) == NULL ) {fprintf(stderr, "Error allocating memory for channel\n");

102

F Source code

exit(1);}

new->s = old->s;new->id = channel_id++;new->next = old->next;new->prev = old;new->parendepth = old->parendepth;

#ifdef END_REP_MARKERnew->end_rep_marker = old->end_rep_marker;new->suspend_output = old->suspend_output;

#endifreturn new;

}

voidchannel_free(struct Channel *ch){

free(ch);}

voidchannel_remove(struct Channel *ch){

struct Channel *prev, *next;

next = ch->next;prev = ch->prev;

if(ch->next == NULL){if(ch->prev != NULL)

ch->prev->next = NULL;}else if(ch->prev == NULL){

if(ch->next != NULL)ch->next->prev = NULL;

}else {

ch->next->prev = ch->prev;ch->prev->next = ch->next;

}

if(ch == clist.first)clist.first = ch->next;

channel_free(ch);}

103

F Source code

voidfollow_epsilon(struct Channel *cur, FILE *outstream){

unsigned int i;

assert(cur != NULL);

while(true){assert(cur->s != NULL);switch(cur->s->type){case NFA_EPSILON:

switch(cur->s->subtype){

case LEFT_PAREN:if(cur->parendepth == 0)

write_bit(’1’, outstream);#ifdef END_REP_MARKER

if(!cur->suspend_output)#endif

cur->parendepth++;break;

case RIGHT_PAREN:#ifdef END_REP_MARKER

if(!cur->suspend_output)#endif

cur->parendepth--;break;

case END_SPLIT:for(i = 0; i < cur->s->parencount; i++){

write_bit(’0’, outstream);}break;

#ifdef END_REP_MARKERcase END_REPEAT:

cur->end_rep_marker = true;#endif

}// Fallthrough

case NFA_LITERAL:cur->s = cur->s->out0;break;

default:

104

F Source code

return;}

}}

voiddo_end(char c, FILE *outstream){struct Channel *tmp;

if(clist.cur == NULL){fprintf(stderr, "Error: Channel corruption (%i)\n",__LINE__);exit(0);

}

write_bit(c, outstream);tmp = clist.cur;clist.cur = clist.cur->next;channel_remove(tmp);

}

voiddo_split(FILE *outstream){

struct Channel *tmp;


}write_bit(’=’, outstream);

tmp = channel_copy(clist.cur);if(tmp->next != NULL)

tmp->next->prev = tmp;clist.cur->next = tmp;

}

voiddo_one(FILE *outstream){

unsigned int i;

if(clist.cur == NULL){

105

F Source code

fprintf(stderr, "Error: Channel corruption (%i)\n",__LINE__);exit(0);

}assert(clist.cur->s->type == NFA_SPLIT);

#ifdef END_REP_MARKERif(clist.cur->end_rep_marker){

clist.cur->suspend_output++;clist.cur->end_rep_marker = false;

}#endif

// Make transition on out1-arrowfor(i = 0; i < clist.cur->s->parencount; i++){

write_bit(’0’, outstream);}if(clist.cur->parendepth)

write_bit(’1’, outstream);clist.cur->s = clist.cur->s->out1;follow_epsilon(clist.cur, outstream);

}

voiddo_zero(FILE *outstream){if(clist.cur == NULL){

fprintf(stderr, "Error: Channel corruption (%i)\n",__LINE__);exit(0);

}

assert(clist.cur->s->type == NFA_SPLIT);

#ifdef END_REP_MARKERif(clist.cur->end_rep_marker){

clist.cur->suspend_output--;clist.cur->end_rep_marker = false;

}#endif

if(clist.cur->parendepth)write_bit(’0’, outstream);

clist.cur->s = clist.cur->s->out0;follow_epsilon(clist.cur, outstream);

}

106

F Source code

voiddo_escape(FILE *instream, FILE *outstream){

char c;


}

c = read_bit(instream);//printf("received char: %c %i\n", c, c);if(c == EOF){

fprintf(stderr, "Error: Bad escape (%i)\n", __LINE__);}

// Make transition on out-arrowassert(clist.cur->s->type == NFA_RANGE);clist.cur->s = clist.cur->s->out0;follow_epsilon(clist.cur, outstream);if(clist.cur->parendepth){

write_bit(’\\’, outstream);write_bit(c, outstream);

}}

voidread_mbv(struct NFA nfa, FILE *instream, FILE *outstream){

char c;int i = 0;enum Boolean channel_switch;channel_id = 1;

clist.first = channel(nfa.start);clist.cur = clist.first;follow_epsilon(clist.cur, outstream);

while ( (c = read_bit(instream)) != EOF ){i++;//printf("received char %i: %c %i\n", i, c, c);

switch(c){case ’|’:

channel_switch = true;write_bit(c, outstream);clist.cur = clist.first;

107

F Source code

break;

case ’:’:if(clist.cur == NULL){

fprintf(stderr, "Error: Channel corruption (%i)\n",__LINE__);

exit(0);}write_bit(c, outstream);

if(channel_switch)clist.cur = clist.cur->next;

elsechannel_switch = true;

break;

case ’=’:channel_switch = true;do_split(outstream);break;

case ’t’:// FALLTRHOUGH

case ’b’:channel_switch = false;do_end(c, outstream);break;

case ’0’:channel_switch = true;do_zero(outstream);break;

case ’1’:channel_switch = true;do_one(outstream);break;

case ’\\’:channel_switch = true;do_escape(instream, outstream);break;

case ’*’:channel_switch = true;break;

default:

108

F Source code

fprintf(stderr, "Error: Bad character: %c %i (%i)\n",c, c, __LINE__);

exit(0);}

}}

voiddisplay_usage(void){

puts("main - match regular expression with text generatingmixed bit-values");

puts("usage: main [regex] [text] [more options]");puts("OPTIONS:");puts("These are the long option names, any uniqueabbreviation is also accepted.");

puts("--regular-expression=regex");puts("\tThe regular expression.");

puts("--debug-file=file");puts("\tOptional, this is the file where debug output isdumped.");

puts("\tThe debug output consists of a graph of the NFA inpdf format.");

puts("--output-stream=file");puts("\tOptional, if present output will be written tofile. Default is stdout.");

puts("--input-stream=file");puts("\tOptional, if present output will be read from file. Default is stdin.");

puts("--regular-expression-file=file");puts("\tRead regular expression from file.");

puts("--help");puts("\tWill print this message");

exit(1);}

intmain(const int argc, char* const argv[]){

struct NFA nfa;int c, regexlen, option_index;

109

F Source code

char *regex = NULL;char *debugfile = NULL;FILE *outstream = stdout;FILE *instream = stdin;char *outbuf;char *inbuf;

static struct option long_options[] = {{"help", no_argument, NULL, ’h’},{"regular-expression", required_argument, NULL, ’r’},{"regular-expression-file", required_argument, NULL, ’a’},{"debug-file", required_argument, NULL, ’d’},{"output-stream", required_argument, NULL, ’o’},{"input-stream", required_argument, NULL, ’i’},{0, 0, 0, 0}

};

while(true){c = getopt_long(argc, argv, "hr:a:d:o:i:", long_options,&option_index);if(c == -1)

break;

switch(c){case ’a’:

regexlen = read_file(optarg, &regex);break;

case ’h’:display_usage();break;

case ’r’:regex = optarg;regexlen = strlen(regex);break;

case ’d’:debugfile = optarg;break;

case ’o’:if((outstream = fopen(optarg, "w")) == NULL){

110

F Source code

perror("Can not open file for writing\n");exit(1);

}break;

case ’i’:if((instream = fopen(optarg, "r")) == NULL){

perror("Can not open file for reading\n");exit(1);

}break;

default:break;

}}

if(optind < argc) {regex = argv[optind++];regexlen = strlen(regex);

}

outbuf = init_stream(BUFSIZE, outstream);inbuf = init_stream(BUFSIZE, instream);

nfa = re2nfa(regex, regexlen);if(debugfile != NULL)

print_nfa(debugfile, nfa.start);read_mbv(nfa, instream, outstream);nfa_free(nfa.start);

close_stream(outbuf, outstream);close_stream(inbuf, instream);return 0;

}

ismatch.c

#include <stdio.h>#include <string.h>#include <stdlib.h>#include <getopt.h>#include "util.h"


puts("ismatch - filter to determine if a match hasoccurred");

111

F Source code

puts("usage: ismatch [options]");puts("OPTIONS:");puts("These are the long option names, any uniqueabbreviation is also accepted.");


puts("--input-stream=file");puts("\tOptional, if present input will be read from file.

Default is stdin.");puts("--help");puts("\tWill print this message");

exit(1);}

voidread_mbv(FILE *instream, FILE *outstream){char c;enum Boolean is_escaped = false;enum Boolean empty = true;

while ( (c = read_bit(instream)) != EOF ){empty = false;if(is_escaped == true){

is_escaped = false;continue;

}

switch(c){case ’t’:

write_bit(’t’, outstream);return;

case ’\\’:is_escaped = true;break;

default:is_escaped = false;break;

}}

if(!empty)write_bit(’b’, outstream);

elsefprintf(stderr, "No output to read\n");

112

F Source code

}

intmain(const int argc, char *const argv[]){

int c, option_index;FILE *outstream = stdout;FILE *instream = stdin;//int i;

static struct option long_options[] = {{"help", no_argument, NULL, ’h’},{"output-stream", required_argument, NULL, ’o’},{"input-stream", required_argument, NULL, ’i’},{0, 0, 0, 0}

};

while(true){c = getopt_long(argc, argv, "hi:o:", long_options, &option_index);if(c == -1)

break;

switch(c){case ’o’:

if((outstream = fopen(optarg, "w")) == NULL){perror("Can not open file for writing\n");exit(1);

}break;



}break;

default:break;

}}

//for(i = 0; i < RUNS; i++){read_mbv(instream, outstream);//printf("%c", EOF);//}fclose(instream);

113

F Source code

fclose(outstream);}

main.c

#include <stdio.h>#include <string.h>#include <getopt.h>#include "util.h"#include "nfa.h"


puts("main - match regular expression with text generatingmixed bit-values");

puts("usage: main [regex] [text] [more options]");puts("OPTIONS:");puts("These are the long option names, any uniqueabbreviation is also accepted.");

puts("--regular-expression=regex");puts("\tThe regular expression.");puts("--text=text");puts("\tThe text to be matched.");puts("--debug-file=file");puts("\tOptional, this is the file where debug output isdumped.");



puts("--regular-expression-file=file");puts("\tRead regular expression from file.");puts("--text-file=file");puts("\tRead text to be matched from file.");puts("--help");puts("\tWill print this message");

exit(1);}


int c, regexlen, textlen, option_index;struct NFA nfa;char *regex = NULL;char *text = NULL;

114

F Source code

char *debugfile = NULL;FILE *outstream = stdout;FILE *instream = stdin;char *outbuf;char *inbuf;

static struct option long_options[] = {{"help", no_argument, NULL, ’h’},{"regular-expression", required_argument, NULL, ’r’},{"regular-expression-file", required_argument, NULL, ’a’},{"text-file", required_argument, NULL, ’b’},{"debug-file", required_argument, NULL, ’d’},{"output-stream", required_argument, NULL, ’o’},{0, 0, 0, 0}

};

while(true){c = getopt_long(argc, argv, "a:b:hr:t:d:o:",long_options, &option_index);if(c == -1)

break;



case ’b’:if((instream = fopen(optarg, "r")) == NULL){


}break;



115

F Source code

case ’t’:text = optarg;textlen = strlen(text);break;




}break;

default:break;

}}


}

if(regex == NULL)display_usage();



print_nfa(debugfile, nfa.start);match(nfa, instream, outstream);nfa_free(nfa.start);


}

Makefile

SRC = nfa.c util.c parse.c graphviz.c groupings.c \ismatch.c main.c trace.c serialize.cHDR = match.h nfa.h util.hOBJ = $(SRC:.c=.o)ESM-PM-OBJ = $(SRC:.c=-esm-pm.o)

116

F Source code

ESM-PM-ERM-OBJ = $(SRC:.c=-esm-pm-erm.o)CC = gcc

CFLAGS = -lgvc -O3 -march=i686 -DBUFSIZE=1024#CFLAGS = -pg -lgvc -g -Wall -Wextra -DBUFSIZE=1024#CFLAGS = -lgvc -g -Wall -Wextra -DBUFSIZE=1024

all: groupings_all groupings_single ismatch main traceserialize trace2

trace: trace.o util.o$(CC) $(CFLAGS) util.o trace.o -o trace

trace2: trace2.o util.o$(CC) $(CFLAGS) util.o trace2.o -o trace2

serialize : util.o parse.o graphviz.o match.o serialize.onfa.o

$(CC) $(CFLAGS) nfa.o util.o parse.o graphviz.omatch.o serialize.o -o serialize

groupings_all: nfa-esm-pm.o util-esm-pm.o parse-esm-pm.ographviz-esm-pm.o groupings-esm-pm.o

$(CC) $(CFLAGS) nfa-esm-pm.o util-esm-pm.o parse-esm-pm.o graphviz-esm-pm.o groupings-esm-pm.o -ogroupings_all

groupings_single: nfa-esm-pm-erm.o util-esm-pm-erm.o parse-esm-pm-erm.o graphviz-esm-pm-erm.o groupings-esm-pm-erm.o

$(CC) $(CFLAGS) nfa-esm-pm-erm.o util-esm-pm-erm.oparse-esm-pm-erm.o graphviz-esm-pm-erm.o groupings-esm-pm-erm.o -o groupings_single

ismatch: ismatch.o util.o$(CC) $(CFLAGS) ismatch.o util.o -o ismatch

main : util.o parse.o graphviz.o match.o main.o nfa.o$(CC) $(CFLAGS) nfa.o util.o parse.o graphviz.o

match.o main.o -o main

$(ESM-PM-OBJ) : $(SRC) $(HDR)$(CC) $(CFLAGS) -DPAREN_MARKER -DEND_SPLIT_MARKER -c

$(@:-esm-pm.o=.c) -o $@

$(ESM-PM-ERM-OBJ) : $(SRC) $(HDR)

117

F Source code

$(CC) $(CFLAGS) -DEND_REP_MARKER -DPAREN_MARKER -DEND_SPLIT_MARKER -c $(@:-esm-pm-erm.o=.c) -o $@

%.o : %.c %.h$(CC) $(CFLAGS) -c $<

clean :rm *.o

match.h

#ifndef __match_h#define __match_h

#include "nfa.h"

struct List{

struct State **s;unsigned int n;

};

struct List *list(unsigned int size){

struct List *l;

if ( (l = (struct List *) malloc(sizeof(struct List ))) ==NULL ) {fprintf(stderr, "Could not allocate memory for state-list\n");

}if ( (l->s=(struct State **)malloc(sizeof(struct State *)*size))==NULL ) {fprintf(stderr, "Could not allocate memory for state-list array\n");

}return l;

}

#endif

match.c

#include <stdio.h>#include <string.h>#include "util.h"#include "nfa.h"#include "match.h"

118

F Source code

unsigned int stepid;

voidaddstate(struct List *l, struct State *s, FILE *outstream){

if(s == NULL)return;

if(s->laststep == stepid){write_bit(’b’, outstream);return;

}

s->laststep = stepid;


/* follow unlabeled arrows */write_bit(’=’, outstream);write_bit(’0’, outstream);addstate(l, s->out0, outstream);

write_bit(’:’, outstream);write_bit(’1’, outstream);addstate(l, s->out1, outstream);

return;case NFA_EPSILON:

addstate(l, s->out0, outstream);return;

}

l->s[l->n] = s;l->n++;return;

}

enum Booleanis_in_range(struct Range *r, unsigned int c, enum Boolean

is_negated,FILE *outstream)

{struct Range *tmp;

tmp = r;

if(!is_negated){

119

F Source code

while(tmp != NULL){if(c >= tmp->lo && c <= tmp->hi){

write_bit(’\\’, outstream);write_bit(c, outstream);return true;

}tmp = tmp->next;

}return false;

}else {

while(tmp!= NULL){if(c >= tmp->lo && c <= tmp->hi){

return false;}tmp = tmp->next;

}write_bit(’\\’, outstream);write_bit(c, outstream);return true;

}}

voidstep(struct List *clist, unsigned int c, struct List *nlist,

FILE *outstream){

unsigned int i;struct State *s;

nlist->n = 0;

for(i = 0; i < clist->n; i++){s = clist->s[i];if(i != 0)

write_bit(’:’, outstream);

if((s->type == NFA_LITERAL && s->c == c) ||(s->type == NFA_RANGE && is_in_range(s->range, c, s->

is_negated,outstream)))

addstate(nlist, s->out0, outstream);else

write_bit(’b’, outstream);}

}

120

F Source code

enum Booleanlast_addstate(struct State *s, FILE *outstream){

enum Boolean result, result1;

if(s == NULL)return false;

if(s->laststep == stepid){write_bit(’b’, outstream);return false;

}

s->laststep = stepid;


/* follow unlabeled arrows */write_bit(’=’, outstream);write_bit(’0’, outstream);result = last_addstate(s->out0, outstream);

write_bit(’:’, outstream);write_bit(’1’, outstream);result1 = last_addstate(s->out1, outstream);

return result || result1;case NFA_EPSILON:

result = last_addstate(s->out0, outstream);return result;

case NFA_ACCEPTING:write_bit(’t’, outstream);return true;

}

write_bit(’b’, outstream);return false;

}

enum Booleanlast_step(struct List *clist, unsigned int c, FILE *

outstream){

unsigned int i;struct State *s;enum Boolean result;

121

F Source code

result = false;for(i = 0; i < clist->n; i++){

s = clist->s[i];if(i != 0)

write_bit(’:’, outstream);

if((s->type == NFA_LITERAL && s->c == c) ||(s->type == NFA_RANGE && is_in_range(s->range, c, s->

is_negated,outstream)))

result = last_addstate(s->out0, outstream) || result;else

write_bit(’b’, outstream);}return result;

}

enum Booleanmatch(struct NFA nfa, FILE *instream, FILE *outstream){

unsigned int i;char cc, cn;enum Boolean result;

struct List *clist, *nlist, *tmp;

clist = list(nfa.statecount);clist->n = 0;nlist = list(nfa.statecount);

stepid = 1;

if((cc = read_bit(instream)) == EOF){result = last_addstate(nfa.start, outstream);return result;

}

addstate(clist, nfa.start, outstream);

while ( (cn = read_bit(instream)) != EOF ){write_bit(’|’, outstream);stepid++;step(clist, cc, nlist, outstream);tmp = clist; clist = nlist; nlist = tmp;

cc = cn;}

122

F Source code

write_bit(’|’, outstream);stepid++;result = last_step(clist, cc, outstream);return result;

}

nfa.h

#ifndef __nfa_h#define __nfa_h

#include "util.h"

/******************************* NFA states

******************************/

#define NFA_SPLIT 256#define NFA_ACCEPTING 257#define NFA_LITERAL 258#define NFA_RANGE 259#define NFA_EPSILON 260

/******************************* Subtypes

******************************/

#define NONE 0#define LEFT_PAREN 1#define RIGHT_PAREN 2#define END_SPLIT 3#define END_REPEAT 4#define REPEAT 5#define ALTERNATE 6

#define PAREN_CAPT 0#define PAREN_NON_CAPT 1

struct Range{

unsigned int lo;unsigned int hi;

struct Range *next;};

enum Color{

white, gray, black

123

F Source code

};

struct State{

// The type of the node: split, accepting, literal,epsilon or range

unsigned int type;// If type is set to literal, this contains the value ofthe literal

unsigned int c;// For freeing the nfaenum Boolean is_seen;

#if defined(PAREN_MARKER) || defined(END_SPLIT_MARKER) ||defined (END_REP_MARKER)

// The subtype of the transition: end_splitint subtype;// If type is set to split or epsilon->end_split,// this contains the parenthesis countunsigned int parencount;

#endif// If type is set to range, this will show if the range is

negatedenum Boolean is_negated;// If the type is set to range, this contains the pointerto the

// Range structurestruct Range *range;// If the type is set to split, literal or range, thiscontains a

// pointer to a following statestruct State *out0;// If the type is set to split, this contains a pointer to

a// following statestruct State *out1;// Flags to ensure we only add each state once to the next

list in// the simulation for each stepunsigned int laststep;

// For debuggingAgnode_t *n;unsigned int id;

};

struct Range *range(unsigned int lo, unsigned int hi);

struct State *state(unsigned int c, unsigned int type,

124

F Source code

struct State *s0, struct State *s1);

void nfa_free(struct State *s);

/******************************* NFA

******************************/struct NFA {

struct State *start;unsigned int statecount;

};

struct NFA re2nfa(const char *re, const unsigned int len);

#endif

nfa.c

#include <stdlib.h>

#include "nfa.h"

struct Range *range(unsigned int lo, unsigned int hi){

struct Range *r;

if ( (r = (struct Range *) malloc(sizeof(struct Range)))== NULL ) {fprintf(stderr, "Error allocating memory for range");exit(1);

}

r->lo = lo;r->hi = hi;r->next = NULL;

return r;}

struct State *state(unsigned int c, unsigned int type, struct State *s0,

struct State *s1){struct State *s;

if ( (s = (struct State *) malloc(sizeof(struct State)))== NULL ) {

125

F Source code

fprintf(stderr, "Error allocating memory for NFA state");exit(1);

}

s->c = c;s->type = type;s->is_negated = false;s->out0 = s0;s->out1 = s1;s->is_seen = false;

#ifdef END_SPLIT_MARKERs->parencount = 0;s->subtype = NONE;

#endifs->n = NULL;s->laststep = 0;

return s;}

voidrange_free(struct Range *r){struct Range *tmp;

while(r != NULL){tmp = r->next;free(r);r = tmp;

}}

voidnfa_free(struct State *s){

if(s == NULL || s->is_seen)return;

s->is_seen = true;


nfa_free(s->out0);nfa_free(s->out1);break;

case NFA_ACCEPTING:

126

F Source code

break;case NFA_RANGE:

range_free(s->range);//fallthrough

default:nfa_free(s->out0);break;

}

free(s);}

parse.c

#include <stdio.h>#include <string.h>#include <assert.h>#include "util.h"#include "nfa.h"

#ifdef END_SPLIT_MARKERunsigned int parendepth;#endif

// Concatenate the top 2 fragments on the stack if possiblevoidmaybe_concat(struct Fragment **stackp, struct Fragment *

stack){

struct Fragment e1, e2;if(*stackp - stack >= 2 && (*stackp)[-1].op == OP_NO

&& (*stackp)[-2].op == OP_NO){e2 = *--(*stackp);e1 = *--(*stackp);ptrlist_patch(e1.out, e2.start);ptrlist_free(e1.out);

*(*stackp)++ = fragment(e1.start, e2.out, OP_NO);#ifdef END_SPLIT_MARKER

(*stackp)[-1].parencount = e1.parencount + e2.parencount;

#endif}

}

// Alternate top fragments if possible// Returns number of new states createdunsigned intmaybe_alternate(struct Fragment **stackp, struct Fragment *

stack)

127

F Source code

{struct State *s;

#ifdef END_SPLIT_MARKERstruct State *s1;

#endifstruct Fragment e1, e2;

if(*stackp - stack >= 3 &&(*stackp)[-1].op == OP_NO &&(*stackp)[-2].op == OP_ALTERNATE &&(*stackp)[-3].op == OP_NO){

e2 = *--(*stackp);// Just pop the alternate marker, no need to look at it--(*stackp);e1 = *--(*stackp);s = state(0, NFA_SPLIT, e1.start, e2.start);

#ifdef END_SPLIT_MARKERs->parencount = e1.parencount;

s1 = state(0, NFA_EPSILON, NULL, NULL);s1->subtype = END_SPLIT;s1->parencount = e2.parencount;ptrlist_patch(e1.out, s1);

*(*stackp)++ = fragment(s, ptrlist_append(e2.out,ptrlist_list1

(&s1->out0)),OP_NO);

(*stackp)[-1].parencount = e1.parencount + e2.parencount;return 2;

#else*(*stackp)++ = fragment(s, ptrlist_append(e2.out, e1.out), OP_NO);return 1;

#endif}if(*stackp - stack >= 2){

if((*stackp)[-1].op == OP_ALTERNATE &&(*stackp)[-2].op == OP_NO){

// Just pop the alternate marker, no need to look atit

--(*stackp);e1 = *--(*stackp);

s = state(0, NFA_SPLIT, e1.start, NULL);

*(*stackp)++ = fragment(s,

128

F Source code

ptrlist_append(e1.out,ptrlist_list1(&s->out1)),

OP_NO);#ifdef END_SPLIT_MARKER

s->parencount = e1.parencount;(*stackp)[-1].parencount = e1.parencount;

#endifreturn 1;

}else if((*stackp)[-1].op == OP_NO &&

(*stackp)[-2].op == OP_ALTERNATE){e1 = *--(*stackp);// Just pop the alternate marker, no need to look at

it--(*stackp);

#ifdef END_SPLIT_MARKERif(e1.parencount > 0){

s1 = state(0, NFA_EPSILON, NULL, NULL);s1->subtype = END_SPLIT;s1->parencount = e1.parencount;s = state(0, NFA_SPLIT, s1, e1.start);

*(*stackp)++ = fragment(s, ptrlist_append(e1.out,

ptrlist_list1(&s1->out0)),OP_NO);

(*stackp)[-1].parencount = e1.parencount;}else {

s = state(0, NFA_SPLIT, NULL, e1.start);

*(*stackp)++ = fragment(s,ptrlist_append(e1.out,

ptrlist_list1(&s->out0)),OP_NO);

(*stackp)[-1].parencount = e1.parencount;

}return 1;

#elses = state(0, NFA_SPLIT, NULL, e1.start);

*(*stackp)++ = fragment(s,ptrlist_append(e1.out,

ptrlist_list1(&s->out0)),OP_NO);

return 1;#endif

}}if(*stackp - stack >= 1){

129

F Source code

// We are not rewriting "||" as "|" as this would changethe// bit-values generatedif((*stackp)[-1].op == OP_ALTERNATE){

// Just pop alternate marker, no need to look at it--(*stackp);s = state(0, NFA_SPLIT, NULL, NULL);

*(*stackp)++ = fragment(s,ptrlist_append(ptrlist_list1(&

s->out0),ptrlist_list1(&

s->out1)),OP_NO);

return 1;}

}return 0;

}

unsigned intdo_right_paren(struct Fragment **stackp, struct Fragment *

stack){

struct State *s;struct Fragment e1, e2;unsigned int result;

maybe_concat(stackp, stack);result = maybe_alternate(stackp, stack);

if(*stackp - stack >= 1){e1 = *--(*stackp);

// Put in a epsilon edge if the parenthesis is emptyif(e1.op == OP_LEFT_PAREN){

s = state(0, NFA_EPSILON, NULL, NULL);result++;

*(*stackp)++ = fragment(s, ptrlist_list1(&s->out0),OP_NO);

return result;}else if(e1.op == OP_LEFT_CAPT_PAREN){

#ifdef PAREN_MARKERparendepth++;s = state(0, NFA_EPSILON, NULL, NULL);s->subtype = LEFT_PAREN;result++;

130

F Source code


#endifs = state(0, NFA_EPSILON, NULL, NULL);result++;


#ifdef END_SPLIT_MARKERparendepth--;// Er det her rigtigt?(*stackp)[-1].parencount = parendepth == 0? 1 : 0;

#endif

#ifdef PAREN_MARKERmaybe_concat(stackp, stack);s = state(0, NFA_EPSILON, NULL, NULL);s->subtype = RIGHT_PAREN;result++;


// It is now safe to concatenate the 2 parenthesismarkers

maybe_concat(stackp, stack);maybe_concat(stackp, stack);

#endifreturn result;

}

if(*stackp - stack >= 1){e2 = *--(*stackp);

assert(e2.op == OP_LEFT_PAREN || e2.op ==OP_LEFT_CAPT_PAREN);

#ifdef PAREN_MARKERif(e2.op == OP_LEFT_CAPT_PAREN){

parendepth++;s = state(0, NFA_EPSILON, NULL, NULL);s->subtype = LEFT_PAREN;result++;


}#endif

*(*stackp)++ = e1;

#ifdef END_SPLIT_MARKERif(e2.op == OP_LEFT_CAPT_PAREN){

131

F Source code

parendepth--;(*stackp)[-1].parencount = parendepth == 0? 1 : 0;

}#endif#ifdef PAREN_MARKER

if(e2.op == OP_LEFT_CAPT_PAREN){s = state(0, NFA_EPSILON, NULL, NULL);s->subtype = RIGHT_PAREN;result++;


maybe_concat(stackp, stack);maybe_concat(stackp, stack);

}#endif

return result;

} else {fprintf(stderr, "Error: Unbalanced parenthesis (%i)\n"

, __LINE__);exit(1);

}} else {

fprintf(stderr, "Error: Unbalanced parenthesis (%i)\n",__LINE__);exit(1);

}}

unsigned intparse_cc_char(const char *re, const unsigned int len,

unsigned int *i){

if(*i < len && re[(*i)] == ’\\’){if(++(*i) < len)

return re[(*i)++];else {fprintf(stderr, "Error: Bad escape at position %i (%i)

\n", *i, __LINE__);exit(1);

}}else

return re[(*i)++];

}

132

F Source code

struct Range *parse_cc_range(const char *re, const unsigned int len,

unsigned int *i){

unsigned int lo, hi;

lo = parse_cc_char(re, len, i);if((len-(*i)) >=2 && re[(*i)] == ’-’ && re[(*i)+1] != ’]’){(*i)++;hi = parse_cc_char(re, len, i);if(hi < lo){

fprintf(stderr, "Error: Bad character range atposition %i (%i)\n", *i,

__LINE__);exit(1);

}return range(lo, hi);

}else

return range(lo, lo);}

struct Fragmentcc2fragment(const char *re, const unsigned int len, unsigned

int *i){

enum Boolean is_negated, first;struct Range **r;struct Fragment e;struct State *s;

// First char is a [, no need to see that(*i)++;

if(*i >= len){fprintf(stderr, "Error: Missing right bracket incharacter class (%i)\n",

__LINE__);exit(1);

}

// Is this character class negated?if(re[(*i)] == ’ˆ’){

is_negated = true;(*i)++;

} elseis_negated = false;

133

F Source code

s = state(0, NFA_RANGE, NULL, NULL);s->is_negated = is_negated;r = &(s->range);e = fragment(s, ptrlist_list1(&s->out0), OP_NO);

first = true;while(*i < len){

if(re[(*i)] != ’]’ || first) {

*r = parse_cc_range(re, len, i);r = &((*r)->next);first = false;

}else

return e;}

fprintf(stderr, "Error: Missing right bracket in characterclass (%i)\n",

__LINE__);exit(1);

}

struct NFAfinish_up_regex(struct Fragment **stackp, struct Fragment *

stack,unsigned int statecount)

{struct State *s, *accept;struct Fragment e;struct NFA nfa;

accept = state(0, NFA_ACCEPTING, NULL, NULL);statecount++;

if(*stackp - stack == 0){nfa.start = accept;nfa.statecount = statecount;return nfa;

}

maybe_concat(stackp, stack);maybe_concat(stackp, stack);statecount += maybe_alternate(stackp, stack);

if(*stackp - stack == 1){e = *--(*stackp);

134

F Source code

switch(e.op){case OP_ALTERNATE:

s = state(0, NFA_SPLIT, NULL, NULL);statecount++;e = fragment(s, ptrlist_append(ptrlist_list1(&s->out0)

,ptrlist_list1(&s->out1)

),OP_NO);

// Fallthroughcase OP_NO:

ptrlist_patch(e.out, accept);ptrlist_free(e.out);break;

case OP_LEFT_PAREN:fprintf(stderr, "Error: Unbalanced parenthesis (%i)\n"


case OP_LEFT_CAPT_PAREN:fprintf(stderr, "Error: Unbalanced parenthesis (%i)\n"


default:fprintf(stderr, "Error: Bad regular expression (%i)\n"


}

nfa.start = e.start;nfa.statecount = statecount;return nfa;

}else {

fprintf(stderr, "Error: Unbalanced parenthesis (%i)\n",__LINE__);exit(1);

}}

unsigned intdo_quantifier(struct Fragment **stackp, struct Fragment *

stack,unsigned int quantifier)

{struct Fragment e;struct State *s;

#ifdef END_REP_MARKERstruct State *s1;

135

F Source code

#endif

if(*stackp <= stack){fprintf(stderr,

"Error: Quantifier follows nothing (%i)\n",__LINE__);

exit(1);}

e = *--(*stackp);

if(e.op != OP_NO){fprintf(stderr,

"Error: Quantifier follows nothing (%i)\n",__LINE__);

exit(1);}

s = state(0, NFA_SPLIT, e.start, NULL);#ifdef END_REP_MARKER

s1 = state(0, NFA_EPSILON, s, NULL);s1->subtype = END_REPEAT;

#endif

switch(quantifier){case ’*’:

#ifdef END_REP_MARKERptrlist_patch(e.out, s1);

#elseptrlist_patch(e.out, s);

#endifptrlist_free(e.out);

*(*stackp)++ = fragment(s, ptrlist_list1(&s->out1),OP_NO);break;

case ’?’:

*(*stackp)++ = fragment(s, ptrlist_append(e.out,ptrlist_list1

(&s->out1)),OP_NO);

break;case ’+’:

#ifdef END_REP_MARKERptrlist_patch(e.out, s1);

#elseptrlist_patch(e.out, s);

#endif

136

F Source code

ptrlist_free(e.out);

*(*stackp)++ = fragment(e.start, ptrlist_list1(&s->out1), OP_NO);break;

}

return 1;}

unsigned intread_paren_type(const char *re, const unsigned int len,

unsigned int *i){

if(*i+2 < len && re[*i+1] == ’?’ && re[*i+2] == ’:’){

*i += 2;return PAREN_NON_CAPT;

}else

return PAREN_CAPT;}

struct NFAre2nfa(const char *re, const unsigned int len){

unsigned int i, statecount;struct State *s;struct Fragment *stackp;struct Fragment stack[len > 1? len : 1];

stackp = stack;statecount = 0;#ifdef END_SPLIT_MARKERparendepth = 0;#endif

for(i = 0; i < len; i++){switch(re[i]){case ’*’:

// FALLTHROUGHcase ’?’:

// FALLTHROUGHcase ’+’:

do_quantifier(&stackp, stack, re[i]);break;

case ’|’:maybe_concat(&stackp, stack);

137

F Source code

statecount += maybe_alternate(&stackp, stack);// Push new alternate operator onto stack

*stackp++ = fragment(NULL, NULL, OP_ALTERNATE);break;

case ’(’:maybe_concat(&stackp, stack);

if(read_paren_type(re, len, &i) == PAREN_CAPT){

*stackp++ = fragment(NULL, NULL, OP_LEFT_CAPT_PAREN);

}else

*stackp++ = fragment(NULL, NULL, OP_LEFT_PAREN);break;

case ’)’:statecount += do_right_paren(&stackp, stack);break;

case ’[’:maybe_concat(&stackp, stack);statecount++;

*stackp++ = cc2fragment(re, len, &i);break;

case ’.’:maybe_concat(&stackp, stack);

s = state(0, NFA_RANGE, NULL, NULL);

*stackp++ = fragment(s, ptrlist_list1(&s->out0), OP_NO);

s->range = range(0, 255);break;

case ’\\’:i++;if(i >= len){

fprintf(stderr, "Error: Bad escape at position %i (%i)\n",

i, __LINE__);exit(1);

}

// FALLTHROUGH

default:maybe_concat(&stackp, stack);s = state(re[i], NFA_LITERAL, NULL, NULL);

138

F Source code

statecount++;

*stackp++ = fragment(s, ptrlist_list1(&s->out0), OP_NO);

break;}

}

// Finish up the regex and splice in an accepting statereturn finish_up_regex(&stackp, stack, statecount);

}

serialize.c

#include <stdio.h>#include <string.h>#include <stdlib.h>#include <getopt.h>#include "nfa.h"#include "util.h"

voidfollow(struct State **s, FILE *outstream){

while(true){assert((*s) != NULL);switch((*s)->type){case NFA_EPSILON:

(*s) = (*s)->out0;break;

case NFA_LITERAL:write_bit((*s)->c, outstream);(*s) = (*s)->out0;break;

default:return;

}}

}

voidread_bv(struct NFA nfa, FILE *instream, FILE *outstream){

char c;struct State *cur;

cur = nfa.start;

139

F Source code

while ( (c = read_bit(instream)) != EOF ){//printf("received char: %c\n", c);assert(cur != NULL);switch(c){case ’0’:

cur = cur->out0;follow(&cur, outstream);break;

case ’1’:cur = cur->out1;follow(&cur, outstream);break;

case ’\\’:c = read_bit(instream);if(c == EOF){

fprintf(stderr, "Error: Bad escape (%i)\n", __LINE__);

}write_bit(c, outstream);cur = cur->out0;follow(&cur, outstream);break;

default:fprintf(stderr, "Error: Bad character: %c %i (%i)\n",

c, c, __LINE__);exit(0);

}}

}


puts("serialize - ");puts("usage: main [regex] [more options]");puts("OPTIONS:");puts("These are the long option names, any uniqueabbreviation is also accepted.");

puts("--regular-expression=regex");puts("\tThe regular expression.");puts("--debug-file=file");puts("\tOptional, this is the file where debug output isdumped.");


140

F Source code


puts("--input-stream=file");puts("\tOptional, if present input will be read from file.

Default is stdin.");puts("--regular-expression-file=file");puts("\tRead regular expression from file.");puts("--help");puts("\tWill print this message");

exit(1);}

intmain(const int argc, char* const argv[]){

int c, regexlen, option_index;struct NFA nfa;char *debugfile = NULL;char *regex;FILE *outstream = stdout;FILE *instream = stdin;char *outbuf;char *inbuf;

static struct option long_options[] = {{"help", no_argument, NULL, ’h’},{"regular-expression", required_argument, NULL, ’r’},{"regular-expression-file", required_argument, NULL, ’a’},{"debug-file", required_argument, NULL, ’d’},{"output-stream", required_argument, NULL, ’o’},{"input-stream", required_argument, NULL, ’i’},{0, 0, 0, 0}

};

while(true){c = getopt_long(argc, argv, "hr:a:d:o:i:", long_options,&option_index);if(c == -1)

break;

141

F Source code








}break;



}break;

default:break;

}}


}



print_nfa(debugfile, nfa.start);read_bv(nfa, instream, outstream);

142

F Source code

nfa_free(nfa.start);


}

trace.c

Optimized version.

#include <stdio.h>#include <string.h>#include <stdlib.h>#include <getopt.h>#include "util.h"

struct Channel {int cnum;int nnum;

char *bits;unsigned int bits_size;unsigned int bits_count;

};

voidchannel_write_bit(struct Channel *chan, char bit){

if(chan->bits_size <= chan->bits_count){chan->bits_size = chan->bits_size * 2;if ((chan->bits = (char *) realloc(chan->bits, chan->bits_size)) == NULL ) {

perror("Error reallocating memory for bit values\n");exit(1);

}}chan->bits[chan->bits_count++] = bit;

}

inttrace(char *mbv, unsigned int len, char **buf){

int chan_cur = 0;unsigned int i, j, split_count = 0;enum Boolean first = true, even;struct Channel match = {0, 0, NULL, 0, 0};

for(i = 0; i < len; i++){

143

F Source code

//printf("char: %c, i: %i, cur: %i, match.cnum: %i,match.nnum: %i, split_count: %i\n", mbv[i], i, chan_cur,match.cnum, match.nnum, split_count);even = true;for(j = i+1; j < len; j++){

if(mbv[j] == ’\\’)even = !even;

elsebreak;

}if(!even){

if(i+1 < len && mbv[i+1] == ’\\’){if(match.bits != NULL){

if(chan_cur == match.cnum){channel_write_bit(&match, mbv[i]);channel_write_bit(&match, ’\\’);

}else if((chan_cur - match.cnum) <= split_count){

channel_write_bit(&match, mbv[i]);channel_write_bit(&match, ’\\’);

}}i++;continue;

}}

switch(mbv[i]){case ’|’:

match.cnum = match.nnum;split_count = 0;first = false;chan_cur = 0;break;

case ’:’:split_count = 0;chan_cur++;break;

case ’=’:if(match.bits != NULL){

if(((chan_cur - match.cnum) <= split_count) ||(match.cnum >= chan_cur) || (match.cnum == -1)){

match.nnum--;}if(match.cnum <= chan_cur)

split_count++;}

144

F Source code

break;

case ’t’:assert(match.bits == NULL);match.bits_size = 1024;if((match.bits = malloc(match.bits_size)) == NULL){

perror("Can not allocate memory for bits\n");exit(1);

}match.cnum = chan_cur;match.nnum += chan_cur;break;

case ’b’:if(!first) {

if(match.bits != NULL){if(match.cnum >= chan_cur){

match.cnum++;match.nnum++;

}}

}break;

case ’0’:// Fallthrough

case ’1’:if(match.bits != NULL){

if(chan_cur == match.cnum){channel_write_bit(&match, mbv[i]);

}else if((chan_cur - match.cnum) <= split_count){

channel_write_bit(&match, mbv[i]);match.cnum = chan_cur;

}}break;

default:break;

}}

*buf = match.bits;return match.bits_count;

}

145

F Source code

unsigned intread_mbv(char **mbv, FILE *instream){

char c;int size = 1024;int i = 0;

if ((*mbv = (char *) malloc(size)) == NULL ) {perror("Error allocating memory for mixed bit values\n");exit(1);

}

while ( (c = read_bit(instream)) != EOF ){if(i >= size){

size = size*2;if ((*mbv = (char *) realloc(*mbv, size)) == NULL ) {perror("Error reallocating memory for mixed bit

values\n");exit(1);

}}(*mbv)[i++] = c;

}

return i;}

voidreverse(char *str, unsigned int len){

char tmp;unsigned int i;

for(i = 0; i < len / 2; ++i){tmp = str[i];str[i] = str[len-i-1];str[len-i-1] = tmp;

}}

voiddisplay_usage(void){puts("trace - ");puts("usage: trace [options]");puts("OPTIONS:");

146

F Source code

puts("These are the long option names, any uniqueabbreviation is also accepted.");


puts("--input-stream=file");puts("\tOptional, if present output will be read from file. Default is stdin.");

puts("--help");puts("\tWill print this message");

exit(1);}


int c, option_index;unsigned int mbvlen, matchlen;char *mbv, *match;FILE *outstream = stdout;FILE *instream = stdin;char *outbuf;char *inbuf;unsigned int i;

static struct option long_options[] = {{"help", no_argument, NULL, ’h’},{"output-stream", required_argument, NULL, ’o’},{"input-stream", required_argument, NULL, ’i’},{0, 0, 0, 0}

};

while(true){c = getopt_long(argc, argv, "ho:i:", long_options, &option_index);if(c == -1)

break;

switch(c){case ’h’:

display_usage();break;



147

F Source code

}break;



}break;

default:break;

}}


mbvlen = read_mbv(&mbv, instream);reverse(mbv, mbvlen);matchlen = trace(mbv, mbvlen, &match);if(match != NULL){

reverse(match, matchlen);for(i = 0; i < matchlen; i++){

write_bit(match[i], outstream);}

}free(mbv);free(match);


}

trace2.c

Unoptimized version.

#include <stdio.h>#include <string.h>#include <stdlib.h>#include "util.h"

#define DEFAULT_BIT_SIZE 1000#define DEFAULT_CHANNEL_COUNT 100

struct Channel {

148

F Source code

char *bits;char *b;unsigned int size;

struct Channel *next;};

struct Channel *channel(){



}

if ((new->bits = (char *) malloc(sizeof(char)*DEFAULT_BIT_SIZE)) == NULL ) {fprintf(stderr, "Error allocating memory for bits\n");exit(1);

}

new->size = DEFAULT_BIT_SIZE;new->b = new->bits;new->next = NULL;

return new;}

struct Channel *channel_copy(struct Channel *old){struct Channel *new;unsigned int size;


}

if ((new->bits = (char *) malloc(sizeof(char)*old->size))== NULL ) {fprintf(stderr, "Error allocating memory for bits\n");exit(1);

}

149

F Source code

size = old->b - old->bits;memcpy(new->bits, old->bits, size);

new->size = old->size;new->b = new->bits+size;new->next = old->next;

return new;}

voidchannel_free(struct Channel *c){

if(c != NULL){free(c->bits);free(c);

}}

voidchannel_write_bit(struct Channel *c, char bit){

char *tmp;

// Extend the bit array?if(((unsigned int)(c->b - c->bits)) >= c->size){

if ((tmp = (char *) realloc(c->bits, sizeof(char)*c->size*2)) == NULL ) {

fprintf(stderr, "Error allocating memory for bits\n");exit(1);

}c->size = c->size * 2;if(tmp != c->bits){

c->b = (c->b - c->bits) + tmp;c->bits = tmp;

}}

*c->b++ = bit;}

// Inserts element cur in list nlist, where new points tothe last

// element. If cur is NULL, nothing happensvoidchannel_append(struct Channel **nlist, struct Channel **new,

150

F Source code

struct Channel **cur){

if(*cur == NULL)return;

// Insert element in empty listif(*new == NULL){

*nlist = *cur;

*new = *cur;(*cur)->next = NULL;

} else {// Insert element in non-empty list(*cur)->next = NULL;(*new)->next = *cur;

*new = *cur;}

}

// Inserts element cur in list nlist, where new points tothe last

// element and advances cur to next. If cur is NULL, thennothing happens

voidchannel_swap(struct Channel **nlist, struct Channel **new,

struct Channel **cur){struct Channel *tmp;

if(*cur == NULL)return;

tmp = *cur;

// Advance old list

*cur = (*cur)->next;channel_append(nlist, new, &tmp);

}

voidprint_first(struct Channel *mlist){

char *p;

for(p = mlist->bits; p != mlist->b; p++){write_bit(*p, stdout);

}}

151

F Source code

voidread_mbv(){

char c;// b indicates whether channels should be swapped onchannel change

// (some meta characters, like t and b, cause an automaticchannel swap)

int b;// Channels for next iteration is stored in nlist, where// nlist is the head of the list and new is the lastelement

// matches are stored in the mlist, where// mlist is the head of the list and match is the lastelement

struct Channel *nlist, *new, *tmp, *cur, *mlist, *match;

b = 0;

nlist = NULL;new = NULL;mlist = NULL;match = NULL;cur = channel();

while ( (c = read_bit(stdin)) != EOF ){//printf("received char: %c\n", c);switch(c){case ’|’:

if(b == 0)channel_swap(&nlist, &new, &cur);

elseb = 0;

if(cur != NULL){fprintf(stderr, "Error: Channel corruption (%i)\n",

__LINE__);exit(0);

}

cur = nlist;nlist = NULL;new = NULL;break;

case ’:’:if(b == 0)

channel_swap(&nlist, &new, &cur);

152

F Source code

elseb = 0;

break;

case ’=’:b = 0;if(cur == NULL){


exit(0);}tmp = channel_copy(cur);cur->next = tmp;break;

case ’t’:b = 1;if(cur == NULL){


exit(0);}channel_swap(&mlist, &match, &cur);break;

case ’b’:b = 1;

if(cur == NULL){fprintf(stderr, "Error: Channel corruption (%i)\n",

__LINE__);exit(0);

}

tmp = cur;cur = cur->next;channel_free(tmp);break;

case ’0’:b = 0;if(cur == NULL){


exit(0);}

channel_write_bit(cur, ’0’);break;

153

F Source code

case ’1’:b = 0;if(cur == NULL){


exit(0);}

channel_write_bit(cur, ’1’);break;

case ’\\’:c = read_bit(stdin);if(c == EOF){

fprintf(stderr, "Error: Bad escape (%i)\n", __LINE__);

}channel_write_bit(cur, ’\\’);channel_write_bit(cur, c);break;

case ’*’:break;

default:fprintf(stderr, "Error: Bad character (%i)\n",

__LINE__);exit(0);

}}if(mlist != NULL)

print_first(mlist);

while(mlist != NULL){tmp = mlist->next;channel_free(mlist);mlist = tmp;

}

while(nlist != NULL){tmp = nlist->next;channel_free(nlist);nlist = tmp;

}}

int

154

F Source code

main(void){

FILE *outstream = stdout;FILE *instream = stdin;char *outbuf;char *inbuf;


read_mbv();

close_stream(outbuf, outstream);close_stream(inbuf, instream);

return 0;}

util.h

#ifndef __util_h#define __util_h

#include <stdlib.h>#include <graphviz/gvc.h>

enum Boolean{

false, /* false = 0, true = 1 */true

};

int read_file(char *filename, char **buf);

char *init_stream(unsigned int buf_size, FILE *stream);void close_stream(char *buf, FILE *stream);

void write_bit(char bit, FILE *stream);char read_bit(FILE *stream);

/******************************* Operators

******************************/

#define OP_NO 0

155

F Source code

#define OP_ALTERNATE 258#define OP_LEFT_PAREN 259#define OP_LEFT_CAPT_PAREN 260

/******************************* Linked list of pointers

* to NFA states

******************************/

struct Statelist_elem{

struct State **outp;struct Statelist_elem *next;

};

struct Statelist{

struct Statelist_elem *first;struct Statelist_elem *last;

};

struct Statelist *ptrlist_list1(struct State **outp);

// Patches up the dangling pointers in l so they point to svoid ptrlist_patch(struct Statelist *l, struct State *s);

struct Statelist *ptrlist_append(struct Statelist *l1,struct Statelist *l2);

void ptrlist_free(struct Statelist *l);

/******************************* NFA fragments

******************************/

struct Fragment{

unsigned int op;#ifdef END_SPLIT_MARKER

unsigned int parencount;#endif

struct State *start;struct Statelist *out;

};

struct Fragment fragment(struct State *s, struct Statelist *p, int op);

156

F Source code

/******************************* Stack of NFA fragments

******************************/

#endif

util.c

#include "util.h"#include <stdio.h>#include <unistd.h>

char *init_stream(unsigned int buf_size, FILE *stream){char *buf = NULL;

/* if(buf_size == 0){ *//* if(setvbuf(stream, NULL, _IONBF, 0) != 0){ *//* perror("Failed to set unbuffered IO\n"); *//* exit(1); *//* } *//* buf = NULL; *//* } *//* else { *//* if((buf = malloc(buf_size)) == NULL){ *//* perror("Can not allocate memory for IO buffering\n"); */

/* exit(1); *//* } */

/* if(setvbuf(stream, buf, _IOFBF, buf_size) != 0){ *//* perror("Failed to set buffer for IO\n"); *//* exit(1); *//* } *//* } */

return buf;}

voidclose_stream(char *buf, FILE *stream){

fclose(stream);free(buf);

}

voidwrite_bit(char bit, FILE *stream){fputc(bit, stream);

157

F Source code

}

charread_bit(FILE *stream){

return fgetc(stream);}

intread_file(char *filename, char **buf){

FILE *fp;int buf_size = 1024;ssize_t bytes_read = 0;

if((fp = fopen(filename, "r")) == NULL){perror("Can not open file for reading in read_file\n");exit(1);

}

if((*buf = malloc(buf_size)) == NULL){perror("Can not allocate memory for data in read_file\n");exit(1);

}

while(true){bytes_read += read(fileno(fp), (*buf) + bytes_read,buf_size - bytes_read);if(bytes_read == -1){

perror("Could not read data from file in read_file\n");

exit(1);}if(bytes_read >= buf_size){

buf_size = buf_size*2;if((*buf = realloc(*buf, buf_size)) == NULL){

perror("Can not reallocate memory for data inread_file\n");

exit(1);}

} else {break;

}}return bytes_read;

}

158

F Source code

struct Fragmentfragment(struct State *s, struct Statelist *p, int op){

struct Fragment f;

f.start = s;f.out = p;f.op = op;

#ifdef END_SPLIT_MARKERf.parencount = 0;

#endif

return f;}

/******************************* Linked list of pointers

* to NFA states

******************************/

struct Statelist *ptrlist_list1(struct State **outp){

struct Statelist *p;struct Statelist_elem *e;

if ( (p = (struct Statelist *) malloc(sizeof(structStatelist))) == NULL ) {printf("Error allocating memory for state list");exit(1);

}

if ( (e = (struct Statelist_elem *)malloc(sizeof(struct Statelist_elem))) == NULL ) {

printf("Error allocating memory for state list");exit(1);

}

e->outp = outp;e->next = NULL;

p->first = e;p->last = e;

159

F Source code

return p;}

voidptrlist_patch(struct Statelist *l, struct State *s){

struct Statelist_elem *e;

e = l->first;while(e != NULL){

*(e->outp) = s;e = e->next;

}}

struct Statelist *ptrlist_append(struct Statelist *l1, struct Statelist *l2){

if(l1 == NULL)return l2;

if(l1 == l2) {return l1;

}

l1->last->next = l2->first;l1->last = l2->last;free(l2);

return l1;}

voidptrlist_free(struct Statelist *s){

struct Statelist_elem *e, *temp;

e = s->first;while(e != NULL){

temp = e;e = e->next;free(temp);

}

free(s);}

160

A streaming full regular expression parser · 2020-05-02 · A streaming full regular expression...

Documents

Transcript of A streaming full regular expression parser · 2020-05-02 · A streaming full regular expression...