1 Python & Pattern Matching with Regular Expressions (REs) OPIM 101 File:PythonREs.ppt.

21
1 Python & Pattern Matching with Regular Expressions (REs) OPIM 101 File:PythonREs.ppt

Transcript of 1 Python & Pattern Matching with Regular Expressions (REs) OPIM 101 File:PythonREs.ppt.

1

Python & Pattern Matchingwith Regular Expressions (REs)

OPIM 101

File:PythonREs.ppt

2

Foresight

• Pattern matching– Literal– With metacharacters

• Regular expressions (REs)

• Using REs in Python

3

Consider: dir by ItselfD:\athomepc\day\idt>dir

Volume in drive D has no label Volume Serial Number is 3E4B-1609 Directory of D:\athomepc\day\idt

. <DIR> 01-01-02 8:16a .

.. <DIR> 01-01-02 8:16a ..SPRING~1 PDF 180,072 01-01-02 8:17a spring02idtfront.pdfSPRING~2 PDF 241,542 01-01-02 8:19a spring02idtpartI.pdfSPRING~3 PDF 1,246,514 01-01-02 8:20a spring02idtpartII.pdfSPRING~4 PDF 2,517,343 01-01-02 8:22a spring02idtpartIII.pdfSPRING~5 PDF 3,469,138 01-01-02 8:24a spring02idtpartIV.pdfCASE1-~1 DOC 35,328 01-01-02 8:42a case1-python.docLECTUR~1 PPT 78,336 01-01-02 9:45a lecture01fall01.pptPYTHON~1 PPT 34,816 01-01-02 9:46a Python_Intro.pptPYTHON~2 PPT 37,376 01-01-02 9:46a Python_Structures.pptLECTUR~2 PPT 154,112 01-01-02 11:51a lecture01spring02.pptPYTHON~3 PPT 34,816 01-01-02 11:52a PythonREs.ppt 11 file(s) 8,029,393 bytes 2 dir(s) 1,209.06 MB free

D:\athomepc\day\idt>

4

Now: dir with a Literal Search

D:\athomepc\day\idt>dir case1-python.doc

Volume in drive D has no label Volume Serial Number is 3E4B-1609 Directory of D:\athomepc\day\idt

CASE1-~1 DOC 35,328 01-01-02 8:42a case1-python.doc 1 file(s) 35,328 bytes 0 dir(s) 1,209.06 MB free

D:\athomepc\day\idt>

5

Now: dir with “*”

D:\athomepc\day\idt>dir *.doc

Volume in drive D has no label Volume Serial Number is 3E4B-1609 Directory of D:\athomepc\day\idt

CASE1-~1 DOC 35,328 01-01-02 8:42a case1-python.doc 1 file(s) 35,328 bytes 0 dir(s) 1,209.06 MB free

D:\athomepc\day\idt>

6

Literal vs. Pattern Searches

• dir myfile.doc– Searches literally, for an exact match with

“myfile.doc”

• dir my*.doc– Does a pattern search. Matches to any file

beginning with “my”, followed by 0 or more characters of any kind, followed by “.doc”

7

MetaCharacters

• dir treats “*” as a metacharacter, a character not taken literally, but as instruction to match a certain kind of pattern (here: anything)

• The dir metacharacter scheme is very useful

8

On Beyond *

• ...and also very primitive and limited

• A step up: grep in Unix & Linux; support for RE searches in some text editors, e.g., TextPad (www.textpad.com)

• Regular expressions (REs) use a richer language and larger set of metacharacters, giving us a very powerful capability to extract information (patterns) from text

9

Python’s RE Metacharacters

• Here’s the complete list:

. ^ $ * + ? { } [ ] \ | ( )• No use memorizing. We’ll learn by

examples.

• A natural question: But what if I want to search for a pattern that contains what Python’s RE counts as metacharacters?– Be just a little patient

10

Load Python’s re Module>>> import re>>> teststring = "Television is public anomie number 1.”>>> teststring'Television is public anomie number 1.’>>> len(teststring)37>>> match = re.search('anomie',teststring)>>> match == None0>>> match.span()(21, 27)>>> teststring[21:27]'anomie’>>>

11

Now a Nonliteral Match

>>> match = re.search('Television',teststring)>>> match == None0>>> match = re.search('television',teststring)>>> match == None1>>> match = re.search('[tT]elevision',teststring)>>> match.span()(0, 10)>>> teststring'Television is public anomie number 1.’>>>

12

Square Bracket Notation: [...]

• “[tT]” means “any one of the characters ‘t’ or ‘T’.”

• [...] is called a character class

• Examples:– [abc], [a-z], [A-Z]– [^t^T] not t and not T

13

Not Example ^

>>> teststring'Television is public anomie number 1.’>>> match = re.search('[^t^T][a-z]+',teststring)>>> match.span()(1, 10)>>> teststring[1:10]'elevision’>>>

Note: + means “one or more of the previous”

* means “zero or more” ? means “zero or one”

14

'\s\w+\.' and '\s(\w+)\.'

>>> teststring'Television is public anomie number 1.’>>> match = re.search('\s\w+\.',teststring)>>> match.span()(34, 37)>>> teststring[34:37]' 1.’>>> match = re.search('\s(\w+)\.',teststring)>>> match.span(0)(34, 37)>>> match.span(1)(35, 36)>>> teststring[35:36]'1’>>>

15

[.] == \.

• Inside [...] most metacharacters are taken literally– So, [.] == \.

• Note (again): [...] is called a character class

>>> match = re.search('\s(\w+)[.]',teststring)>>> match.span()(34, 37)>>>

16

Avoiding Greed ?>>> newstring = '<div align="center">’>>> newstring = newstring+'<i class="smaller">’>>> newstring = newstring+'(As of 10:55 AM on 12/20/01)’>>> newstring = newstring+'</i></div><br>’>>> newstring'<div align="center"><i class="smaller">(As of 10:55 AM on 12/20/01)</i></div><br>’>>> match = re.search('<.+>',newstring)>>> match.span()(0, 81)>>> match = re.search('<.+?>',newstring)>>> match.group()<div align="center">’>>>

17

More on Not Being Greedy

>>> match = re.search(r'<(\w).+?>(.+)</(\1)',newstring)>>> match.groups()('d', '<i class="smaller">(As of 10:55 AM on 12/20/01)</i>', 'd')>>> match = re.search(r'<(\w).+?>([^<]+)</(\1)',newstring)>>> match.groups()('i', '(As of 10:55 AM on 12/20/01)', 'i')>>>

\1 is called a backreference. It refers to group 1

18

Concluding

• REs are a very powerful tool, very often very useful

• The language notation is compact and a bit hard to read

• Practice, study the examples, don’t worry about memorization.

19

Advice on Scripting

• Scripting, and programming in general, is a process

• Successful scripts don’t spring into existence whole

– Scripts built in small increments

• Attend to:

– Decomposition

– Stories

– Testing

20

Advice on Scripting

• Decomposition– Solve big problems by decomposing them into small

problems and solving them

• Stories– Scripting/programming as a form of literature

– Use comments with code to tell a clear story about what the code is or should be doing

• Testing– Everything, whole and part, often, varying inputs