Copyright © 2008-2015 Curt Hill Regular Expressions Providing a Search Pattern.

Post on 17-Jan-2016

212 views 0 download

Tags:

Transcript of Copyright © 2008-2015 Curt Hill Regular Expressions Providing a Search Pattern.

Copyright © 2008-2015 Curt Hill

Regular Expressions

Providing a Search Pattern

What are they?• A special text pattern for describing a

search pattern• This text pattern allows special

sequences to have special meaning• Any other characters may just

appear in the searched string

Copyright © 2008-2015 Curt Hill

Specials• The special characters include– [ ]\^*$.?+(){}– The braces may be literal or special depending

on their usage

• Any other character just matches itself• ThusHelloas a pattern just matches the obvious string

• Since many of these characters are valuable in strings the escape is used to match them

Copyright © 2008-2015 Curt Hill

Escape• The backslash character is the escape• Thus to look for an asterisk (a special)

in a string it must be escaped: \*– This allows a search to find the asterisk

• The C family uses some of the same escape sequences:– \n newline or linefeed– \t tab– \r carriage return

Copyright © 2008-2015 Curt Hill

Coded escapes• An x and two hexadecimal digits may

also follow the backslash• Thus

\x4Egives the ASCII character with hexadecimal value 4E (an N in ASCII)

Copyright © 2008-2015 Curt Hill

Positioning• There are two specials that force a

position• ^ matches the beginning of the line• $ matches the end of the line• Both of these match a position rather

than a character• Without these a pattern could match

anywhere within a string

Copyright © 2008-2015 Curt Hill

Positioning examples• The pattern:

^Hiwill match any line that starts with the two characters H and I

• The pattern:,$will match any line that ends with a comma

• The pattern:^Hello$will match only a line that has Hello as its only content

Copyright © 2008-2015 Curt Hill

Wildcards• The dot will match any one character

– Except end of line control characters

• Thus A.Bcould match ABB, ACB, A.B or any other three character sequence starting with A and ending with B

Copyright © 2008-2015 Curt Hill

Repetition• It is often desirable to repeat a

pattern a fixed number of times• This is done by following the pattern

with a set of braces with an integer inside

• Thus abbbcis the same asab{3}c

Copyright © 2008-2015 Curt Hill

Repetition• There are three repetition characters

which are more general• Closure is the *

– It represents zero or more repetitions of the previous item

• The + represents one or more repetitions of the previous item

• The ? represent zero or one occurrences of the previous item

Copyright © 2008-2015 Curt Hill

Examples• ~* matches any number (including

zero) of successive tildes• \-* matches zero or more dashes• .+ matches one or more of any

character• hats? matches either hat or hats

Copyright © 2008-2015 Curt Hill

Grouping• The repetitions could only be applied

to a single character• What is next needed is some type of

grouping• This is provided by the parenthesis• Enclosing a pattern in parenthesis

makes it a group• This group can then be followed by a

repetition character

Copyright © 2008-2015 Curt Hill

Examples• (*-)* will match

– *-– *-*-– *-*-*- etc

• The * is greedy – it will try to match as many of these as is possible

Copyright © 2008-2015 Curt Hill

More interesting patterns• A number is pretty to understand

from our perspective but not so easy to describe – Except in regular expressions

• An integer is a string of digits– Possibly preceded by a plus or minus

• So how is this done?• With sets and repetition

Copyright © 2008-2015 Curt Hill

A set• A pair of brackets may be filled with

character• This will match any one of them• Thus the digits could be done with:[0123456789]

• An integer could then be:[-+]? [0123456789]+

• Any single vowel is:[aeiouAEIOU]

Copyright © 2008-2015 Curt Hill

Ranges in sets• The letters are somewhat more than

we want to type• The range is handled by a dash:[0-9]is the same as[0123456789]

• The letters are then:[a-zA-Z]

• If you want a dash in a set place it first

Copyright © 2008-2015 Curt Hill

Complement or Negation• You may place a caret ^ at the

beginning of a set to ask for any character but those present

• Thus [^0-9]is any character but a digit

Copyright © 2008-2015 Curt Hill

Shortcut sets• Several classes are so commonly used

that a shortcut exists• This is an escaped character• \d is a digit [0-9]• \D is not a digit [^0-9]• \w is an alphanumeric [a-zA-Z0-9_]• \W is not an alphanumeric [^a-zA-Z0-

9_]• \s is whitespace [ \r\n\t\f\v]

– \f is formfeed, \v is vertical tab

• \S is not whitespace [^ \r\n\t\f\v]

Copyright © 2008-2015 Curt Hill

Specials• In some sense the right parenthesis,

right bracket and dash are ambiguous as specials

• If found in certain contexts they are regular and in others as specials

• The rights are only special if there is a leading left

• Dash is only special in a set and following another character

Copyright © 2008-2015 Curt Hill

Alternation• A set provides intuitive alternation• The match process may choose any

character within the set to use• The alternation is only applied to

number of single characters• There is also an alternation character

– The vertical bar |

• This allows either simple or complicated patterns to alternate

Copyright © 2008-2015 Curt Hill

Alternation• Thus:

A|E|I|O|U is equivalent to [AEIOU]

• However, more interesting alternations are possible and useful– (abc)|(123) will match either of the two

strings– ([-+]?\d)+|(\w+) will match any string of

characters that looks like a number or word

Copyright © 2008-2015 Curt Hill

How to use in JavaScript?• There are two ways that deserve

some attention• Strings have a search and replace

method– Easiest– Will deal with this one first

• The RegExp object– Most versatile and most complicated

Copyright © 2008-2015 Curt Hill

String search• The search method takes a RegExp

pattern and returns an integer position

• The result is the index if found and -1 if the pattern has not been found

• If the pattern is a string it is cast into a RegExp– You cannot always use the other

features of the RegExp object– It is a powerful feature anyhow

Copyright © 2008-2015 Curt Hill

One little glitch• Since the escape is the \ for both

strings and regular expressions we have a little problem

• To code the pattern\.for a literal dot, we would have to code:“\\.”

• Since this awkward we do something else

Copyright © 2008-2015 Curt Hill

Regular Expression Pattern• JavaScript has an alternative form for

regular expression patterns• Instead of enclosing the string in quotes

where the escape sequence must be dealt with it uses the forward slash as the delimitter

• Thus:/\./is a valid regular expression pattern equivalent to “\\.”

Copyright © 2008-2015 Curt Hill

Slash Notation• This notation looks funny but avoids

the doubling of the escape character• It may be assigned to variables:

var s = /\$\d*/• Doing so makes s a RegExp object

Copyright © 2008-2015 Curt Hill

Pattern Modifiers• There are several pattern modifiers

– Lower case letters that follow the slash pattern notation

• An i means ignore case on whole pattern– /[A-Z]*/i will match any string of any

letters

• Others are possible as well– m and g

• These are also known as flagsCopyright © 2008-2015 Curt Hill

Search example• Considers = "2314 Misc $23.85 in stock";// A pattern for moneynumpat = /\$\d*\.\d*/;int = s.search(numpat);document.write( "<P>position is ",int);

• The result displayed is 12

Copyright © 2008-2015 Curt Hill

String Replace• A search is not the only thing

available – There is also a replace

• Takes two parameters– The search pattern– The replacement string

• Returns the new string• Only one pattern will be replaced

Copyright © 2008-2015 Curt Hill

Example:• This codes = "Welcome to VCSU. VCSU is cool.“;t = s.replace(/VCSU/, "Valley City State");document.write("<P> ",t);

• Will provide the following outputWelcome to Valley City State. VCSU is cool.

Copyright © 2008-2015 Curt Hill

Match• The match method is somewhat

more complicated and will not be considered seriously here

• It is similar to search • Depending on property settings it

may return a single integer position or an array of integers containing all matches

Copyright © 2008-2015 Curt Hill

RegExp object• Clearly there is more than could be

learned from the pattern match• We would like to know

– What actual string was matched– What was the last position of the

matched string– Among many others

• This will also help us to modify how things are done

Copyright © 2008-2015 Curt Hill

Constructor• Just assigning a pattern to a variable

does construction:re = /\d*/i;

• You may also use a regular constructor– The first parameter is the pattern– The second the modifiers– re = new RegExp(/Hello/,”i”)

Copyright © 2008-2015 Curt Hill

exec Method• The exec method returns the

characters that matched• The parameter is the string• Example:re = /[0-9]+/;s = re.exec( “answers 239 and 512”);

• Returns the 239 as a string• Does the search thing but produces a

string instead of a number– Returns null for failure

Copyright © 2008-2015 Curt Hill

Global searching• You may set the global searching

modifier with the g suffix• Each search will set the lastIndex

property to the where the search pattern ended– First location not matched

• A subsequent search will start at this location

• If the object does not have global set, the lastIndex will not be changed

Copyright © 2008-2015 Curt Hill

Example• Consider:re = /[0-9]+/g;str = "the answers are 239 and 512“;s = re.exec(str);t = re.exec(str);

• The s will hold the 239 and the t the 512• More serious manipulations could use

lastIndex to do more complicated things

Copyright © 2008-2015 Curt Hill