Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html...

Post on 21-Jun-2020

3 views 0 download

Transcript of Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html...

PythonStringology

Marcin Młotkowski

27th March, 2013

Regular expressionsResults groupinghtml processingXML processing

1 Regular expressions

2 Results grouping

3 html processing

4 XML processing

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

Regular expressions in examples

MS Windows system

c:\WINDOWS\system32> dir *.exe

Resultaccwiz.exeactmovie.exeahui.exealg.exeappend.exearp.exeasr_fmt.exe,asr_ldm.exe...

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

Examples, cont.

?N*X, *BSD$ rm *.tmp

Examples of regular expression

reg. exp. words’alamakota’ { ’alamakota’ }’(hop!)*’ { ”, ’hop!’, ’hop!hop!’, ’hop!hop!hop!’, ...}’br+um’ { ’brum’, ’brrum’, ’brrrum’, ... }

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

Searching and matching

re library

import re

matching

if automat.match(’brr+um’, ’brrrrum!!!’): print ’matches’

searching

if automat.search(’brr+um’, ’Automobile sounds brrrrum!!!’): print’exists’

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

Regular expression compilation

import reautomat = re.compile(’brr+um’)automat.search(’brrrrum’)automat.match(’brrrrum’)

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

Result interpretation

>>> re.search(’brr+um’, ’brrrum!!!’)

MatchObject

.group(): matched text

.start(): beginning of matched text

.end(): end of matched text

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

Advanced example

TaskOn html page find all references to other pages.

Exampleswww.ii.uni.wroc.plwww.gogole.com

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

Solution

Implementation

adres = ’([a-zA-Z]+\.)*[a-zA-Z]+’automat = re.compile(’http://’ + adres)tekst = fh.read()

[ url.group() for url in automat.finditer(tekst) ]

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

Solution

Implementation

adres = ’([a-zA-Z]+\.)*[a-zA-Z]+’automat = re.compile(’http://’ + adres)tekst = fh.read()

[ url.group() for url in automat.finditer(tekst) ]

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

Metasymbols overview

symbol descriptionw* zero or more repetition of ww+ at least one repetition of ww1|w2 alternative of w1 and w2w{m, n} w occurs at least n times, and at most m times. any character except newlinew? 0 or 1 occurrence of w

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

Popular abbreviations

symbol description\d any digit\w alphanumeric character (depends on LOCALE)\Z end of text

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

Problem with backslash

Role of backslash in Python

’Name\tSurname\n’print ’Tabulator is a character \\t’’c:\\WINDOWS\\win.ini’

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

Backslash in regular expressions

Searching of ’[’

re.match(’\[’, ’[’)

A puzzle

How to find ’\[’?

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

Backslash in regular expressions

Searching of ’[’

re.match(’\[’, ’[’)

A puzzle

How to find ’\[’?

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

Approaches

’\[’re.match(’\[’, ’\[’) # error of regexp compilation

re.match(’\[’, ’[’) # result: None

’\\[’re.match(’\\[’, ’\[’) # error of regexp compilationre.match(’\\[’, ’[’) # result: None

re.match(’\\\[’, ’\[’) # result: Nonere.match(’\\\\[’, ’\[’) # result: None

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

Approaches

’\[’re.match(’\[’, ’\[’) # error of regexp compilationre.match(’\[’, ’[’) # result: None

’\\[’re.match(’\\[’, ’\[’) # error of regexp compilationre.match(’\\[’, ’[’) # result: None

re.match(’\\\[’, ’\[’) # result: Nonere.match(’\\\\[’, ’\[’) # result: None

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

Approaches

’\[’re.match(’\[’, ’\[’) # error of regexp compilationre.match(’\[’, ’[’) # result: None

’\\[’re.match(’\\[’, ’\[’) # error of regexp compilation

re.match(’\\[’, ’[’) # result: None

re.match(’\\\[’, ’\[’) # result: Nonere.match(’\\\\[’, ’\[’) # result: None

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

Approaches

’\[’re.match(’\[’, ’\[’) # error of regexp compilationre.match(’\[’, ’[’) # result: None

’\\[’re.match(’\\[’, ’\[’) # error of regexp compilationre.match(’\\[’, ’[’) # result: None

re.match(’\\\[’, ’\[’) # result: Nonere.match(’\\\\[’, ’\[’) # result: None

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

Approaches

’\[’re.match(’\[’, ’\[’) # error of regexp compilationre.match(’\[’, ’[’) # result: None

’\\[’re.match(’\\[’, ’\[’) # error of regexp compilationre.match(’\\[’, ’[’) # result: None

re.match(’\\\[’, ’\[’) # result: Nonere.match(’\\\\[’, ’\[’) # result: None

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

Ultimate solution

A solutionre.match(’\\\\\[’, ’\[’)re.match(r’\\\[’, ’\[’)

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

String processing

String processing by Python

string in Python ’true’ character’\n’ 0x0A’\t’ 0x0B’\\’ 0x5C

String processing by regular expressions

string in regex ’true’ character’\[’ 0x5B

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

Few words on groups

res = re.match(’a(b*)a.*(a)’, ’abbabbba’)print res.groups()

Result(’bb’, ’a’)

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

Grouping expression

(?P<name>regexp)

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

Task

From data in format ’20061204’ drag day, month, and year.

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

A solution

Regular expression

wzor = r’(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})’

res = re.search(wzor, ’On 20110406 there is a Python lecture’)

print res.group(’year’), res.group(’month’)

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

A solution

Regular expression

wzor = r’(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})’

res = re.search(wzor, ’On 20110406 there is a Python lecture’)

print res.group(’year’), res.group(’month’)

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

A solution

Regular expression

wzor = r’(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})’

res = re.search(wzor, ’On 20110406 there is a Python lecture’)

print res.group(’year’), res.group(’month’)

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

A solution

Regular expression

wzor = r’(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})’

res = re.search(wzor, ’On 20110406 there is a Python lecture’)

print res.group(’year’), res.group(’month’)

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

A solution

Regular expression

wzor = r’(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})’

res = re.search(wzor, ’On 20110406 there is a Python lecture’)

print res.group(’year’), res.group(’month’)

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

A solution

Regular expression

wzor = r’(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})’

res = re.search(wzor, ’On 20110406 there is a Python lecture’)

print res.group(’year’), res.group(’month’)

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

html processing

html file is a string of tags:

<html><title>Tytuł</title><body bgcolor="red"><div align="center">Tekst</div></body></html>

Opening tags<html>, <body>, <div>

Closing tags

</body>, </div>, </html>

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

sgmllib

import sgmllib

class sgmllib.SGMLParser:def start_tag(self, attrs):def end_tag(self):

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

How to use sgmllib

TaskFind all references of ’href’<a href="adres">Text</a>

class MyParser(sgmllib.SGMLParser):

def start_a(self, attrs):for (atr, val) in attrs:

if atr == ’href’: print val

p = MyParser()p.feed(dokument)p.close()

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

How to use sgmllib

TaskFind all references of ’href’<a href="adres">Text</a>

class MyParser(sgmllib.SGMLParser):

def start_a(self, attrs):for (atr, val) in attrs:

if atr == ’href’: print val

p = MyParser()p.feed(dokument)p.close()

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

XML

Example<?xml version="1.0" encoding="UTF-8"?><biblioteka><ksiazka egzemplarze="3"><autor>Ascher, Martelli, Ravenscroft</autor><tytul>Python cookbook</tytul>

</ksiazka><ksiazka><autor/><tytul>Python for beginners</tytul>

</ksiazka></biblioteka>

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

XML processing

processing of subsequent elements (saxutils)create a tree (DOM) corresponding to xml

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

SAX — Simple Api for XML

elements of documents are read step by stepfor each element a proper method is called

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

Parser implementation

Default parser

from xml.sax import *

class saxutils.DefaultHandler:def startDocument(self): passdef endDocument(self): passdef startElement(self, name, attrs): passdef endElement(self, name): passdef characters(self, value): pass

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

Own parser implementation

class SaxReader(saxutils.DefaultHandler):

def characters(self, value):print value

def startElement(self, name, attrs):for x in attrs.keys():

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

How to use parser

from xml.sax import make_parserfrom xml.sax.handler import feature_namespacesfrom xml.sax import saxutils

parser = make_parser()parser.setFeature(feature_namespaces, 0)dh = SaxReader()parser.setContentHandler(dh)parser.parse(fh)

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

SAX: summary

Read-only mode processing;processes parts of document;SAX is fast, with small memory requirements.

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

DOM: Document Object Model

A document is kept entirely as a treeA document (its tree) can be modified;Processing needs time and memory, all tree is kept in memory;Specification of DOM is driven by W3C.

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

Reminder

Example<?xml version="1.0" encoding="UTF-8"?><biblioteka><ksiazka egzemplarze="3"><autor>Ascher, Martelli, Ravenscroft</autor><tytul>Python. Receptury</tytul>

</ksiazka><ksiazka><autor/><tytul>Python. Od podstaw</tytul>

</ksiazka></biblioteka>

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

A picture

Document

<?xml version="1.0" encoding="UTF-8"?>

Element Text Element

""Text""

Text""

Element<biblioteka>

<ksiazka> <ksiazka>

Element

<autor>

Element

<tytul>

Text

Asher, ...

Text

Python. Od ...

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

Python libraries

xml.dom: DOM Level 2xml.dom.minidom: Lightweight DOM implementation, DOMLevel 1

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

minidom implementation

A class Node

class attribute example.nodeName library, book, author.nodeValue "Python cookbook".attributes <book copies="3">.childNodes list of subnodes

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

Tree creation

XML file processingimport xml

def wezel(node):print node.nodeNamefor n in node.childNodes:

wezel(n)

doc = xml.dom.minidom.parse(’content.xml’)wezel(doc)

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

DOM processing

Node manipulation

appendChild(newChild)removeChild(oldChild)replaceChild(newChild, oldChild)

New node creationnew = document.createElement(’chapter’)new.setAttribute(’number’, ’5’)document.documentElement.appendChild(new)

print document.toxml()

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

DOM processing

Node manipulation

appendChild(newChild)removeChild(oldChild)replaceChild(newChild, oldChild)

New node creationnew = document.createElement(’chapter’)new.setAttribute(’number’, ’5’)document.documentElement.appendChild(new)

print document.toxml()

Marcin Młotkowski Python

Regular expressionsResults groupinghtml processingXML processing

Summarize: DOM

process entire treeneeds a lot of time and memory for large files

Marcin Młotkowski Python