Hacker 102 - regexes w/Javascript, Python

Post on 18-Dec-2014

1.461 views 1 download

description

Basic introduction to regexes using JavaScript and Python. Developed for code4lib 2010 conference preconf "Hacker 101/102".

Transcript of Hacker 102 - regexes w/Javascript, Python

hacker 102code4lib 2010 preconference

Asheville, NC, USA 2010-02-21

iv. regular expressions

JavaScript

if all languagelooked like

“aabaaaabbbabaababa”it’d be

easy to parse

parsing “aabaaaabbbabaababa”

•there are two elements, “a” and “b”

•either may occur in any order

•/([ab]+)/

• [] denotes “elements” or “class”

• // demarcates regex

• + denotes “one or more of previous thing”

• () denotes “remember this matched group”

• /[ab]/ # an ‘a’ or a ‘b’

• /[ab]+/ # one or more ‘a’s or ‘b’s

• /([ab]+)/ # a group of one or more ‘a’s or ‘b’s

to firebug!

• [a-z] is any lower case char bet. a-z

• [0-9] is any digit

• + is one or more of previous thing

• ? is zero or one of previous thing

• | is or, e.g. [a|b] is ‘a’ or ‘b’

• * is zero to many of previous thing

• . matches any character

• [^a-z] is anything *but* [a-z]

• [a-zA-Z0-9] is any of a-z, A-Z, 0-9

• {5} matches only 5 of the preceding thing

• {2,} matches at least 2 of the preceding thing

• {2,6} matches from 2 to 6 of preceding thing

• [\d] is like [0-9] (any digit)

• [\S] is any non-whitespace

• visit any web page

• open firebug console

• title = window.document.title

• try regexes to match parts of the title

try this

most every languagehas regex support

try unix “grep”

v. glue it together

Python

problem: Carol’s data

TITLE: ABA journal. BD. HOLDINGS: Vol. 70 (1984) - Vol. 94 (2008)CURRENT VOL.: Vol. 95 (2009) -OTHER LIBRARIES: Miami:v. 68 (1982) - USDC: v. 88 (2002) - Birm.:v. 89 (2003) -(Formerly: American Bar Association Journal)(Bound and on Hein)

TITLE: Administrative law review. BD. HOLDINGS: Vol. 22 (1969/1970) - Vol. 60 (2008)CURRENT VOL.: Vol. 61 (2009) - (Bound and on Hein)

starter codefor you

#!/usr/bin/env pythonimport rere_tag = re.compile(r'([A-Z \.]+):')re_title = re.compile('TITLE: (.*)')for line in open('journals-carol-bean.txt'): line = line.strip() m1 = re_tag.match(line) m2 = re_title.match(line) if line == "": continue print "\n->", line, "<-" if m1 or m2: print "MATCH" if m1: print 'tag:', m1.groups() if m2: print 'title:', m2.groups()