Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html...

PythonStringology

Marcin Młotkowski

27th March, 2013

Regular expressionsResults groupinghtml processingXML processing

1 Regular expressions

2 Results grouping

3 html processing

4 XML processing

Marcin Młotkowski Python

Regular expressions in examples

MS Windows system

c:\WINDOWS\system32> dir *.exe

Resultaccwiz.exeactmovie.exeahui.exealg.exeappend.exearp.exeasr_fmt.exe,asr_ldm.exe...

Examples, cont.

?N*X, *BSD$ rm *.tmp

Examples of regular expression

reg. exp. words’alamakota’ { ’alamakota’ }’(hop!)*’ { ”, ’hop!’, ’hop!hop!’, ’hop!hop!hop!’, ...}’br+um’ { ’brum’, ’brrum’, ’brrrum’, ... }

Searching and matching

re library

import re

matching

if automat.match(’brr+um’, ’brrrrum!!!’): print ’matches’

searching

if automat.search(’brr+um’, ’Automobile sounds brrrrum!!!’): print’exists’

Regular expression compilation

import reautomat = re.compile(’brr+um’)automat.search(’brrrrum’)automat.match(’brrrrum’)

Result interpretation

>>> re.search(’brr+um’, ’brrrum!!!’)

MatchObject

.group(): matched text

.start(): beginning of matched text

.end(): end of matched text

Advanced example

TaskOn html page find all references to other pages.

Exampleswww.ii.uni.wroc.plwww.gogole.com

Solution

Implementation

adres = ’([a-zA-Z]+\.)*[a-zA-Z]+’automat = re.compile(’http://’ + adres)tekst = fh.read()

[ url.group() for url in automat.finditer(tekst) ]

Solution

Implementation

adres = ’([a-zA-Z]+\.)*[a-zA-Z]+’automat = re.compile(’http://’ + adres)tekst = fh.read()

[ url.group() for url in automat.finditer(tekst) ]

Metasymbols overview

symbol descriptionw* zero or more repetition of ww+ at least one repetition of ww1|w2 alternative of w1 and w2w{m, n} w occurs at least n times, and at most m times. any character except newlinew? 0 or 1 occurrence of w

Popular abbreviations

symbol description\d any digit\w alphanumeric character (depends on LOCALE)\Z end of text

Problem with backslash

Role of backslash in Python

’Name\tSurname\n’print ’Tabulator is a character \\t’’c:\\WINDOWS\\win.ini’

Backslash in regular expressions

Searching of ’[’

re.match(’\[’, ’[’)

A puzzle

How to find ’\[’?

Backslash in regular expressions

Searching of ’[’

re.match(’\[’, ’[’)

A puzzle

How to find ’\[’?

Approaches

’\[’re.match(’\[’, ’\[’) # error of regexp compilation

re.match(’\[’, ’[’) # result: None

’\\[’re.match(’\\[’, ’\[’) # error of regexp compilationre.match(’\\[’, ’[’) # result: None

re.match(’\\\[’, ’\[’) # result: Nonere.match(’\\\\[’, ’\[’) # result: None

Approaches

’\[’re.match(’\[’, ’\[’) # error of regexp compilationre.match(’\[’, ’[’) # result: None

Approaches

’\\[’re.match(’\\[’, ’\[’) # error of regexp compilation

re.match(’\\[’, ’[’) # result: None

Approaches

Ultimate solution

A solutionre.match(’\\\\\[’, ’\[’)re.match(r’\\\[’, ’\[’)

String processing

String processing by Python

string in Python ’true’ character’\n’ 0x0A’\t’ 0x0B’\\’ 0x5C

String processing by regular expressions

string in regex ’true’ character’\[’ 0x5B

Few words on groups

res = re.match(’a(b*)a.*(a)’, ’abbabbba’)print res.groups()

Result(’bb’, ’a’)

Grouping expression

(?P<name>regexp)

From data in format ’20061204’ drag day, month, and year.

A solution

Regular expression

wzor = r’(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})’

res = re.search(wzor, ’On 20110406 there is a Python lecture’)

print res.group(’year’), res.group(’month’)

A solution

Regular expression

A solution

Regular expression

A solution

Regular expression

A solution

Regular expression

A solution

Regular expression

html processing

html file is a string of tags:

<html><title>Tytuł</title><body bgcolor="red"><div align="center">Tekst</div></body></html>

Opening tags<html>, <body>, <div>

Closing tags

</body>, </div>, </html>

sgmllib

import sgmllib

class sgmllib.SGMLParser:def start_tag(self, attrs):def end_tag(self):

How to use sgmllib

TaskFind all references of ’href’<a href="adres">Text</a>

class MyParser(sgmllib.SGMLParser):

def start_a(self, attrs):for (atr, val) in attrs:

if atr == ’href’: print val

p = MyParser()p.feed(dokument)p.close()

How to use sgmllib

TaskFind all references of ’href’<a href="adres">Text</a>

class MyParser(sgmllib.SGMLParser):

def start_a(self, attrs):for (atr, val) in attrs:

if atr == ’href’: print val

p = MyParser()p.feed(dokument)p.close()

Example<?xml version="1.0" encoding="UTF-8"?><biblioteka><ksiazka egzemplarze="3"><autor>Ascher, Martelli, Ravenscroft</autor><tytul>Python cookbook</tytul>

</ksiazka><ksiazka><autor/><tytul>Python for beginners</tytul>

</ksiazka></biblioteka>

XML processing

processing of subsequent elements (saxutils)create a tree (DOM) corresponding to xml

SAX — Simple Api for XML

elements of documents are read step by stepfor each element a proper method is called

Parser implementation

Default parser

from xml.sax import *

class saxutils.DefaultHandler:def startDocument(self): passdef endDocument(self): passdef startElement(self, name, attrs): passdef endElement(self, name): passdef characters(self, value): pass

Own parser implementation

class SaxReader(saxutils.DefaultHandler):

def characters(self, value):print value

def startElement(self, name, attrs):for x in attrs.keys():

How to use parser

from xml.sax import make_parserfrom xml.sax.handler import feature_namespacesfrom xml.sax import saxutils

parser = make_parser()parser.setFeature(feature_namespaces, 0)dh = SaxReader()parser.setContentHandler(dh)parser.parse(fh)

SAX: summary

Read-only mode processing;processes parts of document;SAX is fast, with small memory requirements.

DOM: Document Object Model

A document is kept entirely as a treeA document (its tree) can be modified;Processing needs time and memory, all tree is kept in memory;Specification of DOM is driven by W3C.

Reminder

Example<?xml version="1.0" encoding="UTF-8"?><biblioteka><ksiazka egzemplarze="3"><autor>Ascher, Martelli, Ravenscroft</autor><tytul>Python. Receptury</tytul>

</ksiazka><ksiazka><autor/><tytul>Python. Od podstaw</tytul>

</ksiazka></biblioteka>

A picture

Document

<?xml version="1.0" encoding="UTF-8"?>

Element Text Element

""Text""

Text""

Element<biblioteka>

Element

<autor>

Element

<tytul>

Asher, ...

Python. Od ...

Python libraries

xml.dom: DOM Level 2xml.dom.minidom: Lightweight DOM implementation, DOMLevel 1

minidom implementation

A class Node

class attribute example.nodeName library, book, author.nodeValue "Python cookbook".attributes <book copies="3">.childNodes list of subnodes

Tree creation

XML file processingimport xml

def wezel(node):print node.nodeNamefor n in node.childNodes:

wezel(n)

doc = xml.dom.minidom.parse(’content.xml’)wezel(doc)

DOM processing

Node manipulation

appendChild(newChild)removeChild(oldChild)replaceChild(newChild, oldChild)

New node creationnew = document.createElement(’chapter’)new.setAttribute(’number’, ’5’)document.documentElement.appendChild(new)

print document.toxml()

DOM processing

Node manipulation

appendChild(newChild)removeChild(oldChild)replaceChild(newChild, oldChild)

New node creationnew = document.createElement(’chapter’)new.setAttribute(’number’, ’5’)document.documentElement.appendChild(new)

print document.toxml()

Summarize: DOM

process entire treeneeds a lot of time and memory for large files

Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html...

Documents

Transcript of Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html...

JAVA Week 5.ppt - Compatibility Modevalerianweb.com/tutor/Assets/AyFd/CO567/CO567 BlueJ Week 5.pdf · 7kh -dyd fodvv oleudu\ 7krxvdqgv ri fodvvhv 7hqv ri wkrxvdqgv ri phwkrgv 0dq\

UTS #18: Unicode Regular Expressionsunicode.org/L2/L2008/08211-uts18-regex.pdf · regular expression syntax is embedded within other syntax it can be difficult to determine where

LS-5. Solid Block Overviewjpamin/dyd/SOKI/FEA_Tutorials_LS5.pdf · midas FEA Training Series LS-5. Solid Block MIDAS Information Technology Co., Ltd. LS-5. Solid Block Overview 3-D

Regex - DMC Cisco Networking Academyacademy.delmar.edu/Courses/ITSC1358/eBooks/man-regex.pdf · a pattern, you can use it for matching or searching any number of times. The Regex

Regex - Clean - MacSysAdmindocs.macsysadmin.se/2013/pdf/regex.pdf · 2013-09-21 · Aani aardvark aardwolf ... grep -l “Just list the matching file names ... # Match a 20th or 21st

Heuristic Ray Shooting Algorithms - Uniwersytet Wrocławskianl/dyd/RGK/dissvh.pdfResume´ Vyzkum´ v oblasti algoritmu˚ pro fotorealistickou syntezu´ obrazu ud´av a´ smˇer vyzkumu´

Unit 15 - National Council of Educational Research and ...ncert.nic.in/textbook/pdf/lhch206.pdf · 15 bl ,dd osQ vè;;u osQ i'pkr~ vkiµ • ikfjHkkf"kd 'kCnksaµ,dyd] cgqyd vkSj

Responsabilidad Social Empresarial FIDES DyD

DyD MAPUCHE sept · Title: DyD MAPUCHE sept Created Date: 10/23/2013 3:18:19 PM

DyD MAPUCHE sept€¦ · Title: DyD MAPUCHE sept Created Date: 10/23/2013 3:18:19 PM

CSC236 Week 9 - cs.toronto.eduylzhang/csc236/files/lec09-regex.pdf · Not all strings in ∑* are in L, such as “sdfasdf”, “ttttt” and the empty string. Terminology: Language

Joey Yap’s BAZI PROFILING™ SYSTEM · Joey Yap’s BAZI PROFILING™ SYSTEM BAZI PROFILING™ DYD CAREER REPORT How to Use Report Date : November 24, 2010

DYD-F105_rev08 Inducción Digital Final

Deir Yassin Day 2006: With But a Wave of His Hand…How Palestine … · 2009. 10. 4. · “How Palestine Became Israel,” part of the Deir Yassin Day (DYD) 2006 commemorations,

General limit equilibrium method for the estimation …aniem/dyd-zips/GLEspiral.pdf2 Simpliﬁed methods without balance of internal forces The proposed approximate solutions by Fellenius

Retorik - Just another WordPress.com weblog · 2012-02-26 · Ethosdyderne: Ethos er hele tre ting på én gang: •Kompetence (phronesis) – jeg er klog; ved hvad jeg taler om •Dyd

Resume Workshop 2018 Rev 1 · l h 3urjudpplqj odqjxdjhv & & 3huo -dyd & )ruwudq 64/ 3+3 6nloov vkrxog eh olvwhg dv vxfflqfwo\ dv srvvleoh /lvwlqj vnloov rqh diwhu dqrwkhu vhsdudwhg

Andrzej Garstecki, Wojciech Gilewski, Zbigniew Pozorski, eds.jpamin/dyd/SOKI/A4.pdf · blach ę zbrojenia, B: ło żysko zbrojone całkowicie pokryte elastomerem, zawieraj ące co

Extension of single-step ssGBLUP to many genotyped individuals · Genomic selection and single-step • Simplicity – No DYD or DP – No index – No complexity • Accuracy –

Digital Approaches to Alcohol Problems: Ten Years Experience with Down Your Drink (DYD) Professor Paul Wallace National Institute of Health Research and.