Cs419 lec3 lexical analysis using re

50
Compilers WELCOME TO A JOURNEY TO CS419 Lecture 3 Scanning (Lexical Analysis) using Regular expressions Cairo University FCI Dr. Hussien Sharaf Computer Science Department [email protected]

Transcript of Cs419 lec3 lexical analysis using re

Page 1: Cs419 lec3   lexical analysis using re

Compilers

WELCOME TO A JOURNEY TO

CS419 Lecture 3

Scanning (Lexical Analysis) using Regular expressions

Cairo UniversityFCI

Dr. Hussien SharafComputer Science [email protected]

Page 2: Cs419 lec3   lexical analysis using re

2

PART ONE

Dr. Hussien M. Sharaf

Page 3: Cs419 lec3   lexical analysis using re

SCANNING A scanner reads a stream of characters

and puts them together into some meaningful (with respect to the source language) units called tokens.

It produces a stream of tokens for the next phase of compiler.

scannerStream of characters

Stream of tokens

3Dr. Hussien M. Sharaf

Page 4: Cs419 lec3   lexical analysis using re

WHAT IS LEXICAL ANALYSIS Lexical Analysis (Scanning): is the task of reading a

stream of input characters and dividing them into tokens(words) that belong to a certain language.

a[index]=4+2; Scanner

Stream

a Identifier[ left bracketindex Identifier] left bracket=assignment4

Number+

operator2

Number

4Dr. Hussien M. Sharaf

Responsibility: accepts and splits a stream into tokens according to rules defined by the source code language.

Page 5: Cs419 lec3   lexical analysis using re

REGULAR EXPRESSIONS RE is a language for describing simple

languages and patterns. Algorithm for using RE

1. Define a pattern: S1StudentnameS22. Loop 3. Find next pattern 4. Store StudentName into DB or encrypt

StudentName5. Until no match

REs are used in applications that involve file parsing and text matching.

Many Implementations have been made for RE.5Dr. Hussien M. Sharaf

Page 6: Cs419 lec3   lexical analysis using re

HOW CAN RE SERVE IN LEXICAL ANALYSIS?

Algorithm for using RE in LA1. Define a pattern for each valid statement: 2. Loop 3. Find next pattern and call it Token4. Parse(Token)5. Until no match6. If Not EOF stream then

return Errorelse

return Success

6Dr. Hussien M. Sharaf

Page 7: Cs419 lec3   lexical analysis using re

RE IN C++ BOOST LIBRARY //Pattern matching in a String // Created by Flavio Castelli <[email protected]> // distrubuted under GPL v2 license

#include <boost/regex.hpp> #include <string> int main() { boost::regex pattern(“Ali”,boost::regex_constants::icase|boost::regex_constants::perl); std::string stringa ("Searching for Ali, Is there another Ali? Yes

there is another Ali"); int count =0;while(boost::regex_search (stringa, pattern,

boost::regex_constants::format_perl)) {count++;}printf (“%d found\n“, count); return 0; }

Please check [http://flavio.castelli.name/2006/02/16/regexp-with-boost/]

7Dr. Hussien M. Sharaf

Output is : “3 found”

Page 8: Cs419 lec3   lexical analysis using re

RE IN C++ STL NEEDS EXTRA FEATURE PACK

#include <regex> #include <iostream> #include <string> bool is_email_valid(const std::string& email) { // define a regular expression

const std::tr1::regex pattern("(\\w+)(\\.|_)?(\\w*)@(\\w+)(\\.(\\w+))+");

// try to match the string with the regular expression

return std::tr1::regex_match(email, pattern); } Please check

http://www.codeguru.com/cpp/cpp/cpp_mfc/stl/article.php/c15339

8Dr. Hussien M. Sharaf

Page 9: Cs419 lec3   lexical analysis using re

RE IN C++ STL (CONT.)

std::string str("abc + (inside brackets)dfsd");std::smatch m;std::regex_search(str, m, std::regex(“b"));if (m[0].matched)std::cout << "Found " << m.str(0) << '\n';elsestd::cout << "Not found\n“;

Please check http://www.codeguru.com/cpp/cpp/cpp_mfc/stl/article.php/c15339

9Dr. Hussien M. Sharaf

Page 10: Cs419 lec3   lexical analysis using re

RE IN PYTHONExample for Python REimport re line = "Cats are smarter than dogs"matchObj = re.search( r'(.*) are(\.*)',

line, re.M|re.I) if matchObj:

print "matchObj.group() : ", matchObj.group() print "matchObj.group(1) : ", matchObj.group(1) print "matchObj.group(2) : ", matchObj.group(2)

else: print "No match!!"

Please check http://www.tutorialspoint.com/python/python_reg_expressions.htm

10Dr. Hussien M. Sharaf

Some notifications of Python RE\d Matches any decimal digit; this is

equivalent to the class [0-9]. \D Matches any non-digit character; this

is equivalent to the class [^0-9]. \s Matches any whitespace character;

this is equivalent to the class [ \t\n\r\f\v].

\S Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].

\w Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].

\W Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].

Page 11: Cs419 lec3   lexical analysis using re

OTHER CODE SAMPLES

Other samples can be check at: Boost Library1. http://stackoverflow.com/questions/5804453/c-regular

-expressions-with-boost-regex2. http://www.codeproject.com/KB/string/regex__.aspx3. http://www.boost.org/doc/libs/1_43_0/more/getting_st

arted/windows.html#build-from-the-visual-studio-ide STL Library1. http://www.codeproject.com/KB/recipes/rexsearch.asp

x2. http://www.codeguru.com/cpp/cpp/cpp_mfc/stl/article.

php/c15339[recommended]

3. http://www.microsoft.com/download/en/confirmation.aspx?id=6922 [feature pack VS2008 Pack]

4. http://channel9.msdn.com/Shows/Going+Deep/C9-Lectures-Stephan-T-Lavavej-Standard-Template-Library-STL-8-of-n [Video]

11Dr. Hussien M. Sharaf

Page 12: Cs419 lec3   lexical analysis using re

12

PART TWO

Dr. Hussien M. Sharaf

Page 13: Cs419 lec3   lexical analysis using re

DEFINITION OF A LANAGUAGE “L”

∑ is the set of alphabets

1. Let ∑ ={x}, Then L = ∑* In RE, we write x* 2. Let ∑ ={x, y}, Then L = ∑* In RE, we write (x+y)* 3. Kleene’s star “*” means any

combination of letters of length zero or more.

Note: Please study the appendix at the end of the lecture

13Dr. Hussien M. Sharaf

Page 14: Cs419 lec3   lexical analysis using re

RE RULES

Given an alphabet ∑, the set of regular expressions is defined by the following rules.1. For every letter in ∑, the letter written in bold is a

regular expression. Λ- lamda or ε-epsilon is a regular expression. Λ or ε means an empty letter.

2. If r1 and r2 are regular expressions, then so are:1. (r1)

2. r1 r2

3. r1+r2

4. r1* and also r2* NOTE 1: r1

+ is not a RENOTE 2: spaces are ignored as long as they are not included in ∑

3. Nothing else is a regular expression.14Dr. Hussien M. Sharaf

Page 15: Cs419 lec3   lexical analysis using re

RE-1

Example 1∑ ={a, b} Formally describe all words with a

followed by any number of b’sL = a b* = ab* Give examples for words in L{a ab abb abbb …..}

15Dr. Hussien M. Sharaf

Page 16: Cs419 lec3   lexical analysis using re

RE-2

Example 2∑ ={a, b} Formally describe the language that

contains nothing and contains words where any a must be followed by one or more b’s

L = b*(abb*)* OR (b+ab)* Give examples for words in L{Λ ab abb abababb ….. b bbb…}

16Dr. Hussien M. Sharaf

Page 17: Cs419 lec3   lexical analysis using re

RE-3

Example ∑ ={a, b} Formally describe all words with a

followed by one or more b’sL = a b+ = abb*

Give examples for words in L{ab abb abbb …..}

17Dr. Hussien M. Sharaf

Page 18: Cs419 lec3   lexical analysis using re

RE-4

Example ∑ ={a, b, c} Formally describe all words that start

with an a followed by any number of b’s and then end with c.

L = a b*c Give examples for words in L{ac abc abbc abbbc …..}

18Dr. Hussien M. Sharaf

Page 19: Cs419 lec3   lexical analysis using re

RE-5

Example ∑ ={a, b} Formally describe all words where a’s if

any come before b’s if any.L = a* b*

Give examples for words in L{Λ a b aa ab bb aaa abb abbb bbb…..}NOTE: a* b* ≠ (ab)*

because first language does not contain abab but second language has.Once single b is detected then no a‘s can be added

19Dr. Hussien M. Sharaf

Page 20: Cs419 lec3   lexical analysis using re

RE-6

Example ∑ ={a} Formally describe all words where

count of a is odd.L = a(aa)* OR (aa)*a Give examples for words in L{a aaa aaaaa …..}

20Dr. Hussien M. Sharaf

Page 21: Cs419 lec3   lexical analysis using re

RE-7.1

Example ∑ ={a, b, c} Formally describe all words where

single a or c comes in the start then odd number of b’s.

L = (a+c)b(bb)* Give examples for words in L{ab cb abbb cbbb …..}

21Dr. Hussien M. Sharaf

Page 22: Cs419 lec3   lexical analysis using re

RE-7.2

Example ∑ ={a, b, c} Formally describe all words where

single a or c comes in the start then odd number of b’s in case of a and zero or even number of b’s in case of c .

L = ab(bb)* +c(bb)* Give examples for words in L{ab c abbb cbb abbbbb …..}

22Dr. Hussien M. Sharaf

Page 23: Cs419 lec3   lexical analysis using re

RE-8

Example ∑ ={a, b, c} Formally describe all words where

one or more a or one or more c comes in the start then one or more b’s.

L = (a+c) + b+= (aa*+cc*) bb* Give examples for words in L{ab cb aabb cbbb …..}

23Dr. Hussien M. Sharaf

Page 24: Cs419 lec3   lexical analysis using re

RE-9

Example ∑ ={a, b} Formally describe all words with length three.L = (a+b) 3 =(a+b) (a+b) (a+b) List all words in L{aaa aab aba baa abb bab bba bbb} What is the count of words of length 4?16 = 24

What is the count of words of length 44?244

24Dr. Hussien M. Sharaf

Page 25: Cs419 lec3   lexical analysis using re

ASSIGNMENTS1. [One week] Solve Compilers-Sheet_0-

General.pdf2. [One week] Solve Compilers_Sheet_1_RE.pdf3. [Two weeks] Implement an email validator

using Python.a. Build a GUI using python Tkinter.b. The GUI must include at least two labels, a

TextBox and a Button.c. The user is given a message on a label stating

whether the email is a valid email or not.

25Dr. Hussien M. Sharaf

Page 26: Cs419 lec3   lexical analysis using re

DEADLINE Thursday 28th of Feb is the deadline. No excuses. Don’t wait until last day. TAs can help you to the highest limit

within the next 3 days.

Dr. Hussien M. Sharaf 26

Page 27: Cs419 lec3   lexical analysis using re

APPENDIX

27Dr. Hussien M. Sharaf

Page 28: Cs419 lec3   lexical analysis using re

APPENDIX MATHEMATICAL NOTATIONS AND

TERMINOLOGY

Dr. Hussien M. Sharaf

Page 29: Cs419 lec3   lexical analysis using re

SETS-[1]

A set is a group of elements represented as a unit.

For exampleS ={a, b, c} a set of 3 elements

Elements included in curly brackets { }a S a belongs to S

f S f does NOT belong to S

Dr. Hussien M. Sharaf

Page 30: Cs419 lec3   lexical analysis using re

SETS-[2] If S ={a, b, c} and

T = {a, b} Then T ST is a subset of S

T S ={a, b}T intersects S ={a,

b}

T S ={a, b, c}T Union S ={a, b, c}

Venn diagram for S and T

Dr. Hussien M. Sharaf

Page 31: Cs419 lec3   lexical analysis using re

SEQUENCES AND TUPLES

A sequence is a list of elements in some order.(2, 4, 6, 8, ….) parentheses

A finite sequence of K-elements is called k-tuple.

(2, 4) 2-tuple or pair(2, 4, 6) 3-tuple

A Cartesian product of S and P (S x P)is a set of 2-tuples/pairs (i, j)where i S and j P

Dr. Hussien M. Sharaf

Page 32: Cs419 lec3   lexical analysis using re

EXAMPLE FOR CARTESIAN PRODUCT

Example (1)If N ={1,2,…} set of integers; O ={+, -}

N x O ={(1,+), (1,-), (2,+), (2,-), …..}Meaningless? Example (2)

N x O x N={(1,+,1), (1,-,1), (2,+,1), (2,-,2), …..}

Does this make sense?

Dr. Hussien M. Sharaf

Page 33: Cs419 lec3   lexical analysis using re

CONTINUE CARTESIAN PRODUCT

Example (3)If A ={a, b,…, z} set of English alphabets;

A x A ={(a, a), (a, b), ..,(d, g), (d, h), …(z, z)}

These are all pairs of set A. Example (4)If U={0,1,2,3…,9} set of digits then U x U

x U ={(1,1,1), (1,1,2),...,(7,4,1), ….., (9,9,9)}

Dr. Hussien M. Sharaf

Page 34: Cs419 lec3   lexical analysis using re

Example (5)If U={0, 1, 2, 3…, 9} set of digits then U x… x U ={(1,..,1), (1,..,2),...,(7,..,1), ...,

(9,..,9)}

Continue Cartesian Product

nCan be written as Un

Dr. Hussien M. Sharaf

Page 35: Cs419 lec3   lexical analysis using re

RELATIONS AND FUNCTIONS A relation is more general than a function. Both maps a set of elements called

domain to another set called co-domain. In case of functions the co-domain can be

called range. R : DC A relation has no restrictions. f : D R A function can not map one element to

two differnet elements in the range.

Dr. Hussien M. Sharaf

Page 36: Cs419 lec3   lexical analysis using re

SURJECTIVE FUNCTION

t1

T

t3

t2

P1

P2

Pn

P4

P3

s1

s3

s2

t1

t3

t2

S T

Bijective function

Many planes fly at the same time

Only one plane lands on one runway at a

time

Dr. Hussien M. Sharaf

Page 37: Cs419 lec3   lexical analysis using re

WHAT IS THE USE OF FUNCTIONS IN CS?

Helps to describe the transition function that transfer the computing device from one state to another.

Any computing device must have clear states.

s1

s3

s2

Ds2

shalt

s3

R

Dr. Hussien M. Sharaf

Page 38: Cs419 lec3   lexical analysis using re

GRAPHS

Is a visual representation of a set and a relation of this set over itself.

G = (V, E)V ={1, 2, 3, 4, 5}E = {(i, j) and (j, i)| i, j

belongs to V}E is a set of pairs ={(1,

3), (3, 1) …(5, 4), (4, 5)}

1

35

2 4

Dr. Hussien M. Sharaf

Page 39: Cs419 lec3   lexical analysis using re

GRAPHS CONSTRUCT Is there a formal language

to describe a graph? G =(V, E)Where : V is a set of n vertices

={i| i < n-1} E is a set of edges. Each

edge is a pair of elements in V={(i, j), (j, i)|i, j V}

or={(i, j) |i, j V }

1

35

2 4

Dr. Hussien M. Sharaf

Page 40: Cs419 lec3   lexical analysis using re

DEFINITIONS

(S Alphabet) : a finite set of letters, denoted S

Letter: an element of an alphabet SWord: a finite sequence of letters from

the alphabet SL (empty string): a word without letters.Language S * (Kleene ‘s Star): the set

of all words on S

Dr. Hussien M. Sharaf

Page 41: Cs419 lec3   lexical analysis using re

STRINGS AND LANGUAGES

A string w1 over an alphabet Σ is a finite sequence of symbols from that alphabet.

1. Σ: is a set of symbols i.e. {a, b, c, …, z} English letters;{0,1, 2,…,9,.} digits of Arabic numbers{AM, PM}different clocking system{1, 2, …, 12}hours of a clock;

Dr. Hussien M. Sharaf

Page 42: Cs419 lec3   lexical analysis using re

STRINGS -2

2.1 String: is a sequence of Σ (sigma) symbols

Σ Example Description

Σ to the power?

{a, b, c, …, z}

apple English string

Σ5

{0,1, 2,…,9,.}

35 the oldest age for girls

Σ2

{AM, PM} PM clocking system

Σ1

{1, 2, …, 12}

12 a specific hour in the day

Σ1 or Σ

2

Dr. Hussien M. Sharaf

Page 43: Cs419 lec3   lexical analysis using re

STRINGS - 3

2.2 Empty String is Λ (Lamda) is of length zero

Σ0 = Λ

2.3 Reverse(xyz) =zyx2.4 Palindrome is a string whose

reverse is identical to itself.If Σ = {a, b} thenPALINDROME ={Λ and all strings x such thatreverse(x) = x }radar, level, reviver, racecar,madam, pop and noon.

Dr. Hussien M. Sharaf

Page 44: Cs419 lec3   lexical analysis using re

STRINGS - 4

2.5 Kleene star * or Kleene closureis similar to cross product of a set/string over itself.

If Σ = {x}, then Σ* = {Λ x xx xxx ….}

If Σ = {x}, then Σ+ = {x xx xxx ….}

If S = {w1 , w2 , w3 } then

S* ={Λ, w1 , w2 , w3 , w1w1 , w1w2 , w1w3 , ….}

S+ ={w1 , w2 , w3 , w1w1 , w1w2 , w1w3 , ….}

Note1: if w3 =Λ, then Λ S+

Note2: S* = S

**

Dr. Hussien M. Sharaf

Page 45: Cs419 lec3   lexical analysis using re

PART THREEEXERCISES

Dr. Hussien M. Sharaf

Page 46: Cs419 lec3   lexical analysis using re

EXERCISE 1

Assume Σ={0, 1}1. How many elements are there in Σ2?Length(Σ) X Length(Σ) = 2 X 2 =42. How many combinations of in Σ3?2 X 2 X 2 = 23 3. How many elements are there in Σn?(Length(Σ))n= 2n

Dr. Hussien M. Sharaf

Page 47: Cs419 lec3   lexical analysis using re

EXERCISE 2

Assume A1={AM, PM}, A2 = {1, 2, …59}, A3 = {1, 2, …12}1. How many elements are there in A1 A3?Length(A1) X Length(A3) = 2 X 12 =24

2. How many elements are there in A1 A2 A3)?

Length(A1) X Length(A3) = 2 X 59 X 12 = 1416

Dr. Hussien M. Sharaf

Page 48: Cs419 lec3   lexical analysis using re

EXERCISE 3

Assume L1={Add, Subtract}, DecimalDigits = {0,1, 2, …9}

1. Construct integer numbers out of L2.DecimalDigits +

-A number of any length of digits.2. Construct a language for assembly

commands from L1 and DecimalDigits .L1 DecimalDigits + {,} DecimalDigits + - Commands in form of Add 1000, 555

Dr. Hussien M. Sharaf

Page 49: Cs419 lec3   lexical analysis using re

EXERCISE 4

Assume HexaDecimal = {0,1, …9,A,B,C,D,E,F}1. How many HexaDecimals of length 4?

164 2. How many HexaDecimals of length n?

16n 3. How many elements are there in

{0, 1}8 ?28 = 256

Dr. Hussien M. Sharaf

Page 50: Cs419 lec3   lexical analysis using re

50

THANK YOU

Dr. Hussien M. Sharaf