Hacker 102 - regexes w/Javascript, Python

16
hacker 102 code4lib 2010 preconference Asheville, NC, USA 2010-02-21

description

Basic introduction to regexes using JavaScript and Python. Developed for code4lib 2010 conference preconf "Hacker 101/102".

Transcript of Hacker 102 - regexes w/Javascript, Python

Page 1: Hacker 102 - regexes w/Javascript, Python

hacker 102code4lib 2010 preconference

Asheville, NC, USA 2010-02-21

Page 2: Hacker 102 - regexes w/Javascript, Python

iv. regular expressions

JavaScript

Page 3: Hacker 102 - regexes w/Javascript, Python

if all languagelooked like

“aabaaaabbbabaababa”it’d be

easy to parse

Page 4: Hacker 102 - regexes w/Javascript, Python

parsing “aabaaaabbbabaababa”

•there are two elements, “a” and “b”

•either may occur in any order

•/([ab]+)/

Page 5: Hacker 102 - regexes w/Javascript, Python

• [] denotes “elements” or “class”

• // demarcates regex

• + denotes “one or more of previous thing”

• () denotes “remember this matched group”

• /[ab]/ # an ‘a’ or a ‘b’

• /[ab]+/ # one or more ‘a’s or ‘b’s

• /([ab]+)/ # a group of one or more ‘a’s or ‘b’s

Page 6: Hacker 102 - regexes w/Javascript, Python

to firebug!

Page 7: Hacker 102 - regexes w/Javascript, Python

• [a-z] is any lower case char bet. a-z

• [0-9] is any digit

• + is one or more of previous thing

• ? is zero or one of previous thing

• | is or, e.g. [a|b] is ‘a’ or ‘b’

• * is zero to many of previous thing

• . matches any character

Page 8: Hacker 102 - regexes w/Javascript, Python

• [^a-z] is anything *but* [a-z]

• [a-zA-Z0-9] is any of a-z, A-Z, 0-9

• {5} matches only 5 of the preceding thing

• {2,} matches at least 2 of the preceding thing

• {2,6} matches from 2 to 6 of preceding thing

• [\d] is like [0-9] (any digit)

• [\S] is any non-whitespace

Page 9: Hacker 102 - regexes w/Javascript, Python

• visit any web page

• open firebug console

• title = window.document.title

• try regexes to match parts of the title

try this

Page 10: Hacker 102 - regexes w/Javascript, Python

most every languagehas regex support

Page 11: Hacker 102 - regexes w/Javascript, Python

try unix “grep”

Page 12: Hacker 102 - regexes w/Javascript, Python

v. glue it together

Python

Page 13: Hacker 102 - regexes w/Javascript, Python

problem: Carol’s data

Page 14: Hacker 102 - regexes w/Javascript, Python

TITLE: ABA journal. BD. HOLDINGS: Vol. 70 (1984) - Vol. 94 (2008)CURRENT VOL.: Vol. 95 (2009) -OTHER LIBRARIES: Miami:v. 68 (1982) - USDC: v. 88 (2002) - Birm.:v. 89 (2003) -(Formerly: American Bar Association Journal)(Bound and on Hein)

TITLE: Administrative law review. BD. HOLDINGS: Vol. 22 (1969/1970) - Vol. 60 (2008)CURRENT VOL.: Vol. 61 (2009) - (Bound and on Hein)

Page 15: Hacker 102 - regexes w/Javascript, Python

starter codefor you

Page 16: Hacker 102 - regexes w/Javascript, Python

#!/usr/bin/env pythonimport rere_tag = re.compile(r'([A-Z \.]+):')re_title = re.compile('TITLE: (.*)')for line in open('journals-carol-bean.txt'): line = line.strip() m1 = re_tag.match(line) m2 = re_title.match(line) if line == "": continue print "\n->", line, "<-" if m1 or m2: print "MATCH" if m1: print 'tag:', m1.groups() if m2: print 'title:', m2.groups()