A Taste of Python - Devdays Toronto 2009

39
a taste of Presented by Jordan Baker October 23, 2009 DevDays Toronto

description

Explores Peter Norvig's spell corrector written in Python as an example of the language's elegance and readability

Transcript of A Taste of Python - Devdays Toronto 2009

Page 1: A Taste of Python - Devdays Toronto 2009

a taste of

Presented by Jordan BakerOctober 23, 2009DevDays Toronto

Page 2: A Taste of Python - Devdays Toronto 2009

About Me

• Open Source Developer

• Founder of Open Source Web Application and CMS service provider: Scryent - www.scryent.com

• Founder of Toronto Plone Users Group - www.torontoplone.ca

Page 3: A Taste of Python - Devdays Toronto 2009

Agenda

• About Python

• Show me your CODE

• A Spell Checker in 21 lines of code

• Why Python ROCKS

• Resources for further exploration

Page 4: A Taste of Python - Devdays Toronto 2009

About Python

http://www.flickr.com/photos/schoffer/196079076/

Page 5: A Taste of Python - Devdays Toronto 2009

About Python

• Gotta love a language named after Monty Python’s Flying Circus

• Used in more places than you might know

Page 6: A Taste of Python - Devdays Toronto 2009

Significant WhitespaceC-like

if(x == 2) { do_something();}do_something_else();

Python

if x == 2: do_something()do_something_else()

Page 7: A Taste of Python - Devdays Toronto 2009

Significant Whitespace

• less code clutter

• eliminates many common syntax errors

• proper code layout

• use an indentation aware editor or IDE

• Get over it!

Page 8: A Taste of Python - Devdays Toronto 2009

Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51) [GCC 4.2.1 (Apple Inc. build 5646)] on darwinType "help", "copyright", "credits" or "license" for more information.>>>

Python is Interactive

Page 9: A Taste of Python - Devdays Toronto 2009

FIZZ BUZZ

12FIZZ4BUZZ...14FIZZ BUZZ

Page 10: A Taste of Python - Devdays Toronto 2009

def fizzbuzz(n):    for i in range(n + 1):        if not i % 3:            print "Fizz",        if not i % 5:            print "Buzz",        if i % 3 and i % 5:            print i,        print

fizzbuzz(50)

FIZZ BUZZ

Page 11: A Taste of Python - Devdays Toronto 2009

def fizzbuzz(n):    for i in range(n + 1):        if not i % 3:            print "Fizz",        if not i % 5:            print "Buzz",        if i % 3 and i % 5:            print i,        print

fizzbuzz(50)

FIZZ BUZZ

Page 12: A Taste of Python - Devdays Toronto 2009

class FizzBuzzWriter(object):    def __init__(self, limit):        self.limit = limit            def run(self):        for n in range(1, self.limit + 1):            self.write_number(n)        def write_number(self, n):        if not n % 3:            print "Fizz",        if not n % 5:            print "Buzz",        if n % 3 and n % 5:            print n,        print        fizzbuzz = FizzBuzzWriter(50)fizzbuzz.run()

FIZZ BUZZ (OO)

Page 13: A Taste of Python - Devdays Toronto 2009

A Spell Checker in 21 Lines of Code

• Written by Peter Norvig

• Duplicated in many languages

• Simple Spellchecking algorithm based on probability

• http://norvig.com/spell-correct.html

Page 14: A Taste of Python - Devdays Toronto 2009

The Approach• Census by frequency

• Morph the word (werd)

• Insertions: waerd, wberd, werzd

• Deletions: wrd, wed, wer

• Transpositions: ewrd, wred, wedr

• Replacements: aerd, ward, wbrd, word, wzrd, werz

• Find the one with the highest frequency: were

Page 15: A Taste of Python - Devdays Toronto 2009

import re, collections

def words(text):    return re.findall('[a-z]+', text.lower())

def train(words):    model = collections.defaultdict(int)    for w in words:       model[w] += 1    return model

NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):    s = [(word[:i], word[i:]) for i in range(len(word) + 1)]    deletes    = [a + b[1:] for a, b in s if b]    transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1]    replaces   = [a + c + b[1:] for a, b in s for c in alphabet if b]    inserts    = [a + c + b     for a, b in s for c in alphabet]    return set(deletes + transposes + replaces + inserts)

def known_edits2(word):    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words):    return set(w for w in words if w in NWORDS)

def correct(word):    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]    return max(candidates, key=NWORDS.get)

Norvig Spellchecker

Page 16: A Taste of Python - Devdays Toronto 2009

def words(text): return re.findall('[a-z]+', text.lower())

>>> words("The cat in the hat!")['the', 'cat', 'in', 'the', 'hat']

Regular Expressions

Page 17: A Taste of Python - Devdays Toronto 2009

>>> d = {'cat':1}>>> d{'cat': 1}>>> d['cat']1

>>> d['cat'] += 1>>> d{'cat': 2}

>>> d['dog'] += 1Traceback (most recent call last):  File "<stdin>", line 1, in <module>KeyError: 'dog' 

Dictionaries

Page 18: A Taste of Python - Devdays Toronto 2009

# Has a factory for missing keys>>> d = collections.defaultdict(int)>>> d['dog'] += 1>>> d{'dog': 1}

>>> int<type 'int'>>>> int()0

def train(words):   model = collections.defaultdict(int)   for w in words:       model[w] += 1   return model

>>> train(words("The cat in the hat!")){'cat': 1, 'the': 2, 'hat': 1, 'in': 1}                

defaultdict

Page 19: A Taste of Python - Devdays Toronto 2009

   >>> text = file('big.txt').read()   >>> NWORDS = train(words(text))   >>> NWORDS   {'nunnery': 3, 'presnya': 1, 'woods': 22, 'clotted': 1, 'spiders': 1,   'hanging': 42, 'disobeying': 2, 'scold': 3, 'originality': 6,   'grenadiers': 8, 'pigment': 16, 'appropriation': 6, 'strictest': 1,   'bringing': 48, 'revelers': 1, 'wooded': 8, 'wooden': 37,   'wednesday': 13, 'shows': 50, 'immunities': 3, 'guardsmen': 4,   'sooty': 1, 'inevitably': 32, 'clavicular': 9, 'sustaining': 5,   'consenting': 1, 'scraped': 21, 'errors': 16, 'semicircular': 1,   'cooking': 6, 'spiroch': 25, 'designing': 1, 'pawed': 1,   'succumb': 12, 'shocks': 1, 'crouch': 2, 'chins': 1, 'awistocwacy': 1,   'sunbeams': 1, 'perforations': 6, 'china': 43, 'affiliated': 4,   'chunk': 22, 'natured': 34, 'uplifting': 1, 'slaveholders': 2,   'climbed': 13, 'controversy': 33, 'natures': 2, 'climber': 1,   'lency': 2, 'joyousness': 1, 'reproaching': 3, 'insecurity': 1,   'abbreviations': 1, 'definiteness': 1, 'music': 56, 'therefore': 186,   'expeditionary': 3, 'primeval': 1, 'unpack': 1, 'circumstances': 107,   ... (about 6500 more lines) ...

   >>> NWORDS['the']   80030   >>> NWORDS['unusual']   32   >>> NWORDS['cephalopod']   0

Reading the File

Page 20: A Taste of Python - Devdays Toronto 2009

import re, collections

def words(text): return re.findall('[a-z]+', text.lower())

def train(words):    model = collections.defaultdict(int)    for w in words:    model[w] += 1    return model

NWORDS = train(words(file('big.txt').read()))

Training the Probability Model

Page 21: A Taste of Python - Devdays Toronto 2009

# These two are equivalent:

result = []for v in iter: if cond:    result.append(expr)

[ expr for v in iter if cond ]

# You can nest loops also:

result = []for v1 in iter1:    for v2 in iter2:        if cond:            result.append(expr)

[ expr for v1 in iter1 for v2 in iter2 if cond ]

 

List Comprehensions

Page 22: A Taste of Python - Devdays Toronto 2009

>>> word = "spam">>> word[:1]'s'>>> word[1:]'pam'

>>> (word[:1], word[1:])('s', 'pam')

>>> range(len(word) + 1)[0, 1, 2, 3, 4]

>>> [(word[:i], word[i:]) for i in range(len(word) + 1)][('', 'spam'), ('s', 'pam'), ('sp', 'am'), ('spa', 'm'), ('spam', '')]

String Slicing

Page 23: A Taste of Python - Devdays Toronto 2009

>>> word = "spam">>> s = [(word[:i], word[i:]) for i in range(len(word) + 1)]

>>> deletes = [a + b[1:] for a, b in s if b]

>>> deletes['pam', 'sam', 'spm', 'spa']

>>> a, b = ('s', 'pam')>>> a's'>>> b'pam'

>>> bool('pam')True>>> bool('')False

Deletions

Page 24: A Taste of Python - Devdays Toronto 2009

For example: teh => the

>>> transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1]

>>> transposes['psam', 'sapm', 'spma']

Transpositions

Page 25: A Taste of Python - Devdays Toronto 2009

>>> alphabet = "abcdefghijklmnopqrstuvwxyz"

>>> replaces = [a + c + b[1:]  for a, b in s for c in alphabet if b]>>> replaces['apam', 'bpam', ..., 'zpam', 'saam', ..., 'szam', ..., 'spaz']

Replacements

Page 26: A Taste of Python - Devdays Toronto 2009

>>> alphabet = "abcdefghijklmnopqrstuvwxyz"

>>> inserts = [a + c + b  for a, b in s for c in alphabet]>>> inserts['aspam', ..., 'zspam', 'sapam', ..., 'szpam', 'spaam', ..., 'spamz']

Insertion

Page 27: A Taste of Python - Devdays Toronto 2009

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):    s = [(word[:i], word[i:]) for i in range(len(word) + 1)]    deletes = [a + b[1:] for a, b in s if b]    transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1]    replaces = [a + c + b[1:] for a, b in s for c in alphabet if b]    inserts = [a + c + b  for a, b in s for c in alphabet]    return set(deletes + transposes + replaces + inserts)

>>> edits1("spam")set(['sptm', 'skam', 'spzam', 'vspam', 'spamj', 'zpam', 'sbam','spham', 'snam', 'sjpam', 'spma', 'swam', 'spaem', 'tspam', 'spmm','slpam', 'upam', 'spaim', 'sppm', 'spnam', 'spem', 'sparm', 'spamr','lspam', 'sdpam', 'spams', 'spaml', 'spamm', 'spamn', 'spum','spamh', 'spami', 'spatm', 'spamk', 'spamd', ..., 'spcam', 'spamy'])

Find all Edits

Page 28: A Taste of Python - Devdays Toronto 2009

def known(words):       """ Return the known words from `words`. """       return set(w for w in words if w in NWORDS)

Known Words

Page 29: A Taste of Python - Devdays Toronto 2009

def known(words):    """ Return the known words from `words`. """    return set(w for w in words if w in NWORDS)

def correct(word):    candidates = known([word]) or known(edits1(word)) or [word]    return max(candidates, key=NWORDS.get)

>>> bool(set([]))False

>>> correct("computr")'computer'

>>> correct("computor")'computer'

>>> correct("computerr")'computer'

Correct

Page 30: A Taste of Python - Devdays Toronto 2009

def known_edits2(word):    return set(        e2            for e1 in edits1(word)                for e2 in edits1(e1)                    if e2 in NWORDS        )

def correct(word):    candidates = known([word]) or known(edits1(word)) or \        known_edits2(word) or [word]    return max(candidates, key=NWORDS.get)

>>> correct("conpuler")'computer'>>> correct("cmpuler")'computer'

Edit Distance 2

Page 31: A Taste of Python - Devdays Toronto 2009

import re, collections

def words(text):    return re.findall('[a-z]+', text.lower())

def train(words):    model = collections.defaultdict(int)    for w in words:       model[w] += 1    return model

NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):    s = [(word[:i], word[i:]) for i in range(len(word) + 1)]    deletes    = [a + b[1:] for a, b in s if b]    transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1]    replaces   = [a + c + b[1:] for a, b in s for c in alphabet if b]    inserts    = [a + c + b     for a, b in s for c in alphabet]    return set(deletes + transposes + replaces + inserts)

def known_edits2(word):    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words):    return set(w for w in words if w in NWORDS)

def correct(word):    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]    return max(candidates, key=NWORDS.get)

Page 32: A Taste of Python - Devdays Toronto 2009

Comparing Python & Java Versions

• http://raelcunha.com/spell-correct.php

• 35 lines of Java

Page 33: A Taste of Python - Devdays Toronto 2009

import java.io.*;import java.util.*;import java.util.regex.*;

class Spelling {

" private final HashMap<String, Integer> nWords = new HashMap<String, Integer>();

" public Spelling(String file) throws IOException {" " BufferedReader in = new BufferedReader(new FileReader(file));" " Pattern p = Pattern.compile("\\w+");" " for(String temp = ""; temp != null; temp = in.readLine()){" " " Matcher m = p.matcher(temp.toLowerCase());" " " while(m.find()) nWords.put((temp = m.group()), nWords.containsKey(temp) ? nWords.get(temp) + 1 : 1);" " }" " in.close();" }

" private final ArrayList<String> edits(String word) {" " ArrayList<String> result = new ArrayList<String>();" " for(int i=0; i < word.length(); ++i) result.add(word.substring(0, i) + word.substring(i+1));" " for(int i=0; i < word.length()-1; ++i) result.add(word.substring(0, i) + word.substring(i+1, i+2) + word.substring(i, i+1) + word.substring(i+2));" " for(int i=0; i < word.length(); ++i) for(char c='a'; c <= 'z'; ++c) result.add(word.substring(0, i) + String.valueOf(c) + word.substring(i+1));" " for(int i=0; i <= word.length(); ++i) for(char c='a'; c <= 'z'; ++c) result.add(word.substring(0, i) + String.valueOf(c) + word.substring(i));" " return result;" }

" public final String correct(String word) {" " if(nWords.containsKey(word)) return word;" " ArrayList<String> list = edits(word);" " HashMap<Integer, String> candidates = new HashMap<Integer, String>();" " for(String s : list) if(nWords.containsKey(s)) candidates.put(nWords.get(s),s);" " if(candidates.size() > 0) return candidates.get(Collections.max(candidates.keySet()));" " for(String s : list) for(String w : edits(s)) if(nWords.containsKey(w)) candidates.put(nWords.get(w),w);" " return candidates.size() > 0 ? candidates.get(Collections.max(candidates.keySet())) : word;" }

" public static void main(String args[]) throws IOException {" " if(args.length > 0) System.out.println((new Spelling("big.txt")).correct(args[0]));" }

}

Page 34: A Taste of Python - Devdays Toronto 2009

import re, collections

def words(text):    return re.findall('[a-z]+', text.lower())

def train(words):    model = collections.defaultdict(int)    for w in words:       model[w] += 1    return model

NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):    s = [(word[:i], word[i:]) for i in range(len(word) + 1)]    deletes    = [a + b[1:] for a, b in s if b]    transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1]    replaces   = [a + c + b[1:] for a, b in s for c in alphabet if b]    inserts    = [a + c + b     for a, b in s for c in alphabet]    return set(deletes + transposes + replaces + inserts)

def known_edits2(word):    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words):    return set(w for w in words if w in NWORDS)

def correct(word):    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]    return max(candidates, key=NWORDS.get)

Page 35: A Taste of Python - Devdays Toronto 2009

IDE for Python

• IDE’s for Python include:

• PyDev for Eclipse

• WingIDE

• IDLE for Windows/ Linux/ Mac

• there’s more

Page 36: A Taste of Python - Devdays Toronto 2009

Why Python ROCKS

• Elegant and readable language - “Executable Pseudocode”

• Standard Libraries - “Batteries Included”

• Very High level Datatypes

• Dynamically Typed

• It’s FUN!

Page 37: A Taste of Python - Devdays Toronto 2009

An Open Source Community

• Projects: Plone, Zope, Grok, BFG, Django, SciPy & NumPy, Google App Engine, PyGame

• PyCon

Page 38: A Taste of Python - Devdays Toronto 2009

Resources

• PyGTA

• Toronto Plone Users

• Toronto Django Users

• Stackoverflow

• Dive into Python

• Python Tutorial

Page 39: A Taste of Python - Devdays Toronto 2009

Thanks

• I’d love to hear your questions or comments on this presentation. Reach me at:

[email protected]

• http://twitter.com/hexsprite