Post on 21-Jun-2020
PythonStringology
Marcin Młotkowski
27th March, 2013
Regular expressionsResults groupinghtml processingXML processing
1 Regular expressions
2 Results grouping
3 html processing
4 XML processing
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
Regular expressions in examples
MS Windows system
c:\WINDOWS\system32> dir *.exe
Resultaccwiz.exeactmovie.exeahui.exealg.exeappend.exearp.exeasr_fmt.exe,asr_ldm.exe...
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
Examples, cont.
?N*X, *BSD$ rm *.tmp
Examples of regular expression
reg. exp. words’alamakota’ { ’alamakota’ }’(hop!)*’ { ”, ’hop!’, ’hop!hop!’, ’hop!hop!hop!’, ...}’br+um’ { ’brum’, ’brrum’, ’brrrum’, ... }
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
Searching and matching
re library
import re
matching
if automat.match(’brr+um’, ’brrrrum!!!’): print ’matches’
searching
if automat.search(’brr+um’, ’Automobile sounds brrrrum!!!’): print’exists’
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
Regular expression compilation
import reautomat = re.compile(’brr+um’)automat.search(’brrrrum’)automat.match(’brrrrum’)
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
Result interpretation
>>> re.search(’brr+um’, ’brrrum!!!’)
MatchObject
.group(): matched text
.start(): beginning of matched text
.end(): end of matched text
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
Advanced example
TaskOn html page find all references to other pages.
Exampleswww.ii.uni.wroc.plwww.gogole.com
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
Solution
Implementation
adres = ’([a-zA-Z]+\.)*[a-zA-Z]+’automat = re.compile(’http://’ + adres)tekst = fh.read()
[ url.group() for url in automat.finditer(tekst) ]
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
Solution
Implementation
adres = ’([a-zA-Z]+\.)*[a-zA-Z]+’automat = re.compile(’http://’ + adres)tekst = fh.read()
[ url.group() for url in automat.finditer(tekst) ]
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
Metasymbols overview
symbol descriptionw* zero or more repetition of ww+ at least one repetition of ww1|w2 alternative of w1 and w2w{m, n} w occurs at least n times, and at most m times. any character except newlinew? 0 or 1 occurrence of w
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
Popular abbreviations
symbol description\d any digit\w alphanumeric character (depends on LOCALE)\Z end of text
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
Problem with backslash
Role of backslash in Python
’Name\tSurname\n’print ’Tabulator is a character \\t’’c:\\WINDOWS\\win.ini’
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
Backslash in regular expressions
Searching of ’[’
re.match(’\[’, ’[’)
A puzzle
How to find ’\[’?
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
Backslash in regular expressions
Searching of ’[’
re.match(’\[’, ’[’)
A puzzle
How to find ’\[’?
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
Approaches
’\[’re.match(’\[’, ’\[’) # error of regexp compilation
re.match(’\[’, ’[’) # result: None
’\\[’re.match(’\\[’, ’\[’) # error of regexp compilationre.match(’\\[’, ’[’) # result: None
re.match(’\\\[’, ’\[’) # result: Nonere.match(’\\\\[’, ’\[’) # result: None
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
Approaches
’\[’re.match(’\[’, ’\[’) # error of regexp compilationre.match(’\[’, ’[’) # result: None
’\\[’re.match(’\\[’, ’\[’) # error of regexp compilationre.match(’\\[’, ’[’) # result: None
re.match(’\\\[’, ’\[’) # result: Nonere.match(’\\\\[’, ’\[’) # result: None
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
Approaches
’\[’re.match(’\[’, ’\[’) # error of regexp compilationre.match(’\[’, ’[’) # result: None
’\\[’re.match(’\\[’, ’\[’) # error of regexp compilation
re.match(’\\[’, ’[’) # result: None
re.match(’\\\[’, ’\[’) # result: Nonere.match(’\\\\[’, ’\[’) # result: None
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
Approaches
’\[’re.match(’\[’, ’\[’) # error of regexp compilationre.match(’\[’, ’[’) # result: None
’\\[’re.match(’\\[’, ’\[’) # error of regexp compilationre.match(’\\[’, ’[’) # result: None
re.match(’\\\[’, ’\[’) # result: Nonere.match(’\\\\[’, ’\[’) # result: None
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
Approaches
’\[’re.match(’\[’, ’\[’) # error of regexp compilationre.match(’\[’, ’[’) # result: None
’\\[’re.match(’\\[’, ’\[’) # error of regexp compilationre.match(’\\[’, ’[’) # result: None
re.match(’\\\[’, ’\[’) # result: Nonere.match(’\\\\[’, ’\[’) # result: None
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
Ultimate solution
A solutionre.match(’\\\\\[’, ’\[’)re.match(r’\\\[’, ’\[’)
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
String processing
String processing by Python
string in Python ’true’ character’\n’ 0x0A’\t’ 0x0B’\\’ 0x5C
String processing by regular expressions
string in regex ’true’ character’\[’ 0x5B
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
Few words on groups
res = re.match(’a(b*)a.*(a)’, ’abbabbba’)print res.groups()
Result(’bb’, ’a’)
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
Grouping expression
(?P<name>regexp)
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
Task
From data in format ’20061204’ drag day, month, and year.
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
A solution
Regular expression
wzor = r’(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})’
res = re.search(wzor, ’On 20110406 there is a Python lecture’)
print res.group(’year’), res.group(’month’)
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
A solution
Regular expression
wzor = r’(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})’
res = re.search(wzor, ’On 20110406 there is a Python lecture’)
print res.group(’year’), res.group(’month’)
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
A solution
Regular expression
wzor = r’(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})’
res = re.search(wzor, ’On 20110406 there is a Python lecture’)
print res.group(’year’), res.group(’month’)
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
A solution
Regular expression
wzor = r’(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})’
res = re.search(wzor, ’On 20110406 there is a Python lecture’)
print res.group(’year’), res.group(’month’)
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
A solution
Regular expression
wzor = r’(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})’
res = re.search(wzor, ’On 20110406 there is a Python lecture’)
print res.group(’year’), res.group(’month’)
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
A solution
Regular expression
wzor = r’(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})’
res = re.search(wzor, ’On 20110406 there is a Python lecture’)
print res.group(’year’), res.group(’month’)
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
html processing
html file is a string of tags:
<html><title>Tytuł</title><body bgcolor="red"><div align="center">Tekst</div></body></html>
Opening tags<html>, <body>, <div>
Closing tags
</body>, </div>, </html>
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
sgmllib
import sgmllib
class sgmllib.SGMLParser:def start_tag(self, attrs):def end_tag(self):
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
How to use sgmllib
TaskFind all references of ’href’<a href="adres">Text</a>
class MyParser(sgmllib.SGMLParser):
def start_a(self, attrs):for (atr, val) in attrs:
if atr == ’href’: print val
p = MyParser()p.feed(dokument)p.close()
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
How to use sgmllib
TaskFind all references of ’href’<a href="adres">Text</a>
class MyParser(sgmllib.SGMLParser):
def start_a(self, attrs):for (atr, val) in attrs:
if atr == ’href’: print val
p = MyParser()p.feed(dokument)p.close()
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
XML
Example<?xml version="1.0" encoding="UTF-8"?><biblioteka><ksiazka egzemplarze="3"><autor>Ascher, Martelli, Ravenscroft</autor><tytul>Python cookbook</tytul>
</ksiazka><ksiazka><autor/><tytul>Python for beginners</tytul>
</ksiazka></biblioteka>
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
XML processing
processing of subsequent elements (saxutils)create a tree (DOM) corresponding to xml
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
SAX — Simple Api for XML
elements of documents are read step by stepfor each element a proper method is called
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
Parser implementation
Default parser
from xml.sax import *
class saxutils.DefaultHandler:def startDocument(self): passdef endDocument(self): passdef startElement(self, name, attrs): passdef endElement(self, name): passdef characters(self, value): pass
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
Own parser implementation
class SaxReader(saxutils.DefaultHandler):
def characters(self, value):print value
def startElement(self, name, attrs):for x in attrs.keys():
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
How to use parser
from xml.sax import make_parserfrom xml.sax.handler import feature_namespacesfrom xml.sax import saxutils
parser = make_parser()parser.setFeature(feature_namespaces, 0)dh = SaxReader()parser.setContentHandler(dh)parser.parse(fh)
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
SAX: summary
Read-only mode processing;processes parts of document;SAX is fast, with small memory requirements.
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
DOM: Document Object Model
A document is kept entirely as a treeA document (its tree) can be modified;Processing needs time and memory, all tree is kept in memory;Specification of DOM is driven by W3C.
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
Reminder
Example<?xml version="1.0" encoding="UTF-8"?><biblioteka><ksiazka egzemplarze="3"><autor>Ascher, Martelli, Ravenscroft</autor><tytul>Python. Receptury</tytul>
</ksiazka><ksiazka><autor/><tytul>Python. Od podstaw</tytul>
</ksiazka></biblioteka>
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
A picture
Document
<?xml version="1.0" encoding="UTF-8"?>
Element Text Element
""Text""
Text""
Element<biblioteka>
<ksiazka> <ksiazka>
Element
<autor>
Element
<tytul>
Text
Asher, ...
Text
Python. Od ...
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
Python libraries
xml.dom: DOM Level 2xml.dom.minidom: Lightweight DOM implementation, DOMLevel 1
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
minidom implementation
A class Node
class attribute example.nodeName library, book, author.nodeValue "Python cookbook".attributes <book copies="3">.childNodes list of subnodes
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
Tree creation
XML file processingimport xml
def wezel(node):print node.nodeNamefor n in node.childNodes:
wezel(n)
doc = xml.dom.minidom.parse(’content.xml’)wezel(doc)
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
DOM processing
Node manipulation
appendChild(newChild)removeChild(oldChild)replaceChild(newChild, oldChild)
New node creationnew = document.createElement(’chapter’)new.setAttribute(’number’, ’5’)document.documentElement.appendChild(new)
print document.toxml()
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
DOM processing
Node manipulation
appendChild(newChild)removeChild(oldChild)replaceChild(newChild, oldChild)
New node creationnew = document.createElement(’chapter’)new.setAttribute(’number’, ’5’)document.documentElement.appendChild(new)
print document.toxml()
Marcin Młotkowski Python
Regular expressionsResults groupinghtml processingXML processing
Summarize: DOM
process entire treeneeds a lot of time and memory for large files
Marcin Młotkowski Python