Lab 8: Sorting, Dictionary

61
Lab 8: File I/O, Mutability vs. Assignment Ling 1330/2330: Intro to Computational Linguistics Na-Rae Han

Transcript of Lab 8: Sorting, Dictionary

Page 1: Lab 8: Sorting, Dictionary

Lab 8: File I/O,

Mutability vs. Assignment

Ling 1330/2330: Intro to Computational Linguistics

Na-Rae Han

Page 2: Lab 8: Sorting, Dictionary

Objectives

1/31/2017 2

File I/O

Writing to a file

File I/O pitfalls

File reference and absolute path

Mutability vs. assignment

Homework #3 Review

Using textstats.py in IDLE shell

Shell tips and tricks

Tab completion

Page 3: Lab 8: Sorting, Dictionary

Finally, working with a file

1/31/2017 3

How to write a script that reads in a text file and processes it? How to write out the result?

File IO (Input/Output)

Read in content of a file

Write out to a file

Write content into an existing file

Also: "pickling" a Python data object

A list, a dictionary, ... NEXT CLASS

Page 4: Lab 8: Sorting, Dictionary

Recap: Opening a text file for reading

1/31/2017 4

f = open('fox_in_sox.txt', 'r') ## read from f f.close()

f is a file object

Closes the file. ALWAYS REMEMBER TO CLOSE YOUR FILE.

'r' for reading. It is default:

can be omitted

name of file to read

May also need to specify encoding:

encoding='utf-8'

Page 5: Lab 8: Sorting, Dictionary

Opening a text file for writing

1/31/2017 5

f = open('myresults.txt', 'w') ## write to f ## write more to f f.close()

Closes the file. ALWAYS REMEMBER TO CLOSE YOUR FILE.

'w' for creating a new file for writing.

If file exists, it will be overwritten!!

Use 'a' instead for appending to

an existing file.

May also need to specify encoding:

encoding='utf-8'

Just like reading, writing to a file also involves twin operations of creating & closing a file object.

Page 6: Lab 8: Sorting, Dictionary

(1) Writing a single string

1/31/2017 6

Creates file foo.txt,

and writes a string, twice

f = open('foo.txt', 'w') f.write('Roses are red,\n') f.write('violets are blue.\n') f.close()

f.write(str) Writes a single string

to file f

'w' option Opens file for writing

Roses are red, violets are blue.

Page 7: Lab 8: Sorting, Dictionary

(2) Writing a list of strings

1/31/2017 7

Creates file foo.txt,

and writes two strings

rose = ['Roses are red\n', 'violets are blue.\n'] f = open('foo.txt', 'w') f.writelines(rose) f.close()

f.writelines(list) Writes a list of strings

to file f

Roses are red, violets are blue.

foo.txt

Page 8: Lab 8: Sorting, Dictionary

.write() works one string at a time

1/31/2017 8

f = open('foo.txt', 'w') f.write('Roses are red,\n', 'violets are blue.\n') f.close()

======================== RESTART ===================== Traceback (most recent call last): File "F:\foo.py", line 3, in <module> f.write('Roses are red,\n', 'violets are blue.\n') TypeError: function takes exactly 1 argument (2 given) >>>

✘ To write a list of strings

in one swoop, use .writelines() instead.

Page 9: Lab 8: Sorting, Dictionary

Line breaks must be supplied

1/31/2017 9

print() by default adds a line break.

.write() and .writelines() DO NOT add a line break.

Line breaks '\n' must be explicitly supplied.

f = open('foo.txt', 'w') f.write('Roses are red,') f.write('violets are blue.') f.close()

Without \n, these will be printed on a single line.

Roses are red,violets are blue.

foo.txt

Page 10: Lab 8: Sorting, Dictionary

Can only write string types

1/31/2017 10

f = open('foo.txt', 'w') pi = 3.14159265 f.write('The value of pi is:\n') f.write(pi) f.close()

pi is float type. .write() can only take

a string argument.

=========================== RESTART ========== Traceback (most recent call last): File "F:\foo.py", line 5, in <module> f.write(pi) TypeError: expected a character buffer object >>>

Page 11: Lab 8: Sorting, Dictionary

Can only write string types

1/31/2017 11

f = open('foo.txt', 'w') pi = 3.14159265 f.write('The value of pi is:\n') f.write(str(pi)) f.close()

str() turns pi into a string.

.write() only takes a string as an argument.

Any other data types (integer, float, etc.) must be first converted into string using str() function.

The value of pi is: 3.14159265

foo.txt

Page 12: Lab 8: Sorting, Dictionary

Don't forget to close file

1/31/2017 12

File writing happens through buffers.

Writing out to a file actually happens when your writing buffer is full or the file is closed.

So, if you forget to close your file, you might find your output file to be either empty or halfway written.

f = open('foo.txt', 'w') f.write('Roses are red\n,') f.write('violets are blue\n.')

Forgot to specify f.close()

foo.txt Always remember to CLOSE YOUR

FILE.

Page 13: Lab 8: Sorting, Dictionary

Practice

1/31/2017 13

Finish the script so it produces the output file.

chom = 'Colorless green ideas sleep furiously.' f = open('foo.txt', 'w') for wd in chom.split() : f.write(wd+'\n') f.close()

Colorless green ideas sleep furiously.

foo.txt

3 minutes

??

Page 14: Lab 8: Sorting, Dictionary

Practice

1/31/2017 14

Finish the script so it produces the output file.

chom = 'Colorless green ideas sleep furiously.' f = open('foo.txt', 'w') for wd in chom.split() : f.write(wd+'\n') f.close()

Colorless green ideas sleep furiously.

foo.txt

Page 15: Lab 8: Sorting, Dictionary

Practice

1/31/2017 15

Modify the script so it produces the output file.

chom = 'Colorless green ideas sleep furiously.' f = open('foo.txt', 'w') for wd in chom.split() : f.write(wd+'\n') f.close()

Colorless is 9 characters long. green is 5 characters long. ideas is 5 characters long. sleep is 5 characters long. furiously. is 10 characters long.

foo.txt

3 minutes

??

Page 16: Lab 8: Sorting, Dictionary

Practice

1/31/2017 16

Modify the script so it produces the output file.

chom = 'Colorless green ideas sleep furiously.' f = open('foo.txt', 'w') for wd in chom.split() : f.write(wd+' is '+str(len(wd))+' characters long.\n') f.close()

Colorless is 9 characters long. green is 5 characters long. ideas is 5 characters long. sleep is 5 characters long. furiously. is 10 characters long.

foo.txt

3 minutes

Page 17: Lab 8: Sorting, Dictionary

Venturing out and into the jungle (aka

your computer's hard drive)

1/31/2017 17

So far, we have been dealing with files right INSIDE your Python's current WD.

We can reference them by their names only, "tale.txt", "fox_in_sox.txt", etc.

What about files that are elsewhere on your hard drive?

We have to understand and use file and directory path.

Page 18: Lab 8: Sorting, Dictionary

How to find absolute file path

1/31/2017 18

Mac Right-click Get Info

Windows Right-click Properties

Page 19: Lab 8: Sorting, Dictionary

Absolute file path

1/31/2017 19

A reference to a file can include a complete file path starting from the very top of the directory hierarchy: absolute file path

OS-X, Linux, Unix: /Users/narae/Documents/test.txt

Starts with the root '/'

Windows:

C:\Users\narae\Documents\test.txt

Starts with drive letter: 'C:', 'D:', etc.

Uses backslash "\" instead of slash "/"

PROBLEMATIC, because backslash has a special

meaning in Python!!

Page 20: Lab 8: Sorting, Dictionary

Windows file path in Python

1/31/2017 20

Windows uses backslash "\", which is a special character in Python.

As a result, Python gives you multiple ways to reference.

Use slash "/" instead, which Python internally

converts to "\"

Use backslash "\" but instead of escaping use r ('raw string') prefix

Use backslash "\" but escape every instance

r'…' forces the following string '…' to be interpreted as literal characters

file = 'C:/Users/narae/Desktop/test.txt' file = 'C:\\Users\\narae\\Desktop\\test.txt' file = r'C:\Users\narae\Desktop\test.txt'

Page 21: Lab 8: Sorting, Dictionary

Current working directory

1/31/2017 21

Software applications typically operate on the notion of current working directory (WD or CWD) Directory that the application is currently operating in.

Typically, "File Open" and "File Save" dialog windows will default to this directory.

With Python IDLE, the default WD is: /Users/naraehan/Documents (Mac)

C:\Program Files (x86)\Python35-32 (Windows)

Not convenient at all.

We already changed these to our Python script directory!! Windows instruction

Mac instruction

Page 22: Lab 8: Sorting, Dictionary

Relative file path, starting from WD

1/31/2017 22

A file's location can be specified relative to Working Directory.

When a file reference does not start from the top, the starting point is assumed to be the WD.

File is located in WD:

In a directory called 'data' inside WD:

Located "one directory up" from WD (WD's parent directory)

f = open('data/fox_in_sox.txt')

f = open('../fox_in_sox.txt')

f = open('fox_in_sox.txt')

.. is a short-hand for the parent directory

Page 23: Lab 8: Sorting, Dictionary

File location and path: WD

1/31/2017 23

When a file is referred to by its name only, it is assumed to be in the current working directory (WD or CWD).

In scripts, WD is the directory where your script is.

f = open('fox_in_sox.txt') # read file f.close() outf = open('results.txt', 'w') # write out to file outf.close()

File will be created in the directory where

foo.py is

File to read must be in the same directory as

the script foo.py

foo.py

Page 24: Lab 8: Sorting, Dictionary

WD in Python shell

1/31/2017 24

In IDLE shell or in command-line, WD may be initially set to:

Your Python script folder, if you have customized your IDLE environment.

If you haven't:

/Users/username/Documents (Mac)

C:\Program Files (x86)\Python35-32 (Windows)

>>> f = open('fox_in_sox.txt') Traceback (most recent call last): File "<pyshell#48>", line 1, in <module> f = open('fox_in_sox.txt') IOError: [Errno 2] No such file or directory: 'fox_in_sox.txt'

Error: The file is not in your shell's current working directory.

Page 25: Lab 8: Sorting, Dictionary

Discovering and changing your WD

1/31/2017 25

>>> import os >>> os.getcwd() 'D:\\Lab' >>> os.chdir('C:/Users/narae/Documents') >>> os.getcwd() 'C:\\Users\\narae\\Documents'

os.chdir() changes current WD. It is now set to

C:\Users\narae\Documents (Windows)

os module must be imported first. os.getcwd() displays "current WD"

In OS X, directories look like /Users/naraehan/Documents

Windows-internal representation. Directories are separated by '\'

Page 26: Lab 8: Sorting, Dictionary

File path & WD, a summary

1/31/2017 26

Referencing a file with its absolute path: '/Users/naraehan/Documents/fox_in_sox.txt'

'C:/Users/naraehan/Documents/fox_in_sox.txt'

always works.

If referencing with shorthand: 'fox_in_sox.txt', the file has to be in the current WD (working directory). In a SCRIPT: WD is where the script is file and script should be in the same dir.

After your script is executed in IDLE shell, shell's WD changes to the script's location.

In IDLE shell, your initial WD depends on your setting. Find out WD and change it using the os module:

os.getcwd(), os.chdir()

Beware: after running a script, your shell's WD changes to the script's location.

Windows

OS-X, Linux

Page 27: Lab 8: Sorting, Dictionary

File path & WD, recommended practice

1/31/2017 27

If you configured your Python IDLE, it conveniently launches with your own Python script directory as the initial WD.

Keep all your scripts, .txt input and output files in there

File I/O is relatively hassle free.

If you need to work with files somewhere else, be mindful of your WD and the absolute file path.

Page 28: Lab 8: Sorting, Dictionary

Copying a list content, 1st try

1/31/2017 28

>>>

>>>

sim = ['Homer', 'Marge', 'Bart', 'Lisa', 'Maggie']

gran = ['Abe', 'Mona']

How to build a new list of all Simpson family members?

Page 29: Lab 8: Sorting, Dictionary

Copying a list content, 1st try

1/31/2017 29

>>>

>>>

>>>

>>>

>>>

['Homer', 'Marge', 'Bart', 'Lisa', 'Maggie', 'Abe', 'Mona']

>>>

['Homer', 'Marge', 'Bart', 'Lisa', 'Maggie', 'Abe', 'Mona']

>>>

Fail!! sim also changed.

sim = ['Homer', 'Marge', 'Bart', 'Lisa', 'Maggie']

gran = ['Abe', 'Mona']

allsim = sim

allsim.extend(gran)

allsim

sim

What's going on??

Page 30: Lab 8: Sorting, Dictionary

Mutability strikes back

1/31/2017 30

Copying a string

"Copying" a list

>>>

>>>

>>>

>>>

'hello!!'

>>>

'hello'

>>>

x = 'hello'

x2 = x

x2 += '!!'

x2

x

>>>

>>>

>>>

>>>

[1, 2, 3, 4]

>>>

[1, 2, 3, 4]

>>>

x = [1, 2, 3]

x2 = x

x2.append(4)

x2

x

What's going on??

We need to take a closer look at

assignment vs. mutability.

Page 31: Lab 8: Sorting, Dictionary

Assignment: under the hood

1/31/2017 31

Binding a variable in Python means setting a name to hold a reference to some object.

Assignment creates references, not copies.

Names in Python do not have an intrinsic type; objects do. Python determines the type of the reference automatically based on the

data object assigned to it.

You create a name the first time it appears on the left side of an assignment statement:

x = 3

A reference is deleted via garbage collection after any names bound to it have passed out of scope.

x = 3

Page 32: Lab 8: Sorting, Dictionary

Understanding reference semantics

1/31/2017 32

Assignment manipulates references: it's all about pointing.

var2 = var1 does not make a copy of the object var1 references

var2 = var1 makes var2 reference the object var1 references!

int 3 var1

var2

Page 33: Lab 8: Sorting, Dictionary

Reference vs. immutable types

1/31/2017 33

So, for immutable datatypes (integers, floats, strings) assignment behaves as you would expect:

int 'hi' x

>>> x = 'hi' # Creates 'hi', name x refers to it >>> x2 = x # Creates name x2, refers to 'hi' >>> x2 += '!!' # Creates ref for 'hi!!', changes x2 >>> x2 'hi!!' >>> x # No effect on x, still refers to 'hi' 'hi'

Page 34: Lab 8: Sorting, Dictionary

Reference vs. immutable types

1/31/2017 34

So, for immutable datatypes (integers, floats, strings) assignment behaves as you would expect:

int 'hi' x

x2

>>> x = 'hi' # Creates 'hi', name x refers to it >>> x2 = x # Creates name x2, refers to 'hi' >>> x2 += '!!' # Creates ref for 'hi!!', changes x2 >>> x2 'hi!!' >>> x # No effect on x, still refers to 'hi' 'hi'

Page 35: Lab 8: Sorting, Dictionary

Reference vs. immutable types

1/31/2017 35

So, for immutable datatypes (integers, floats, strings) assignment behaves as you would expect:

int 'hi'

int 'hi!!'

x

x2

>>> x = 'hi' # Creates 'hi', name x refers to it >>> x2 = x # Creates name x2, refers to 'hi' >>> x2 += '!!' # Creates ref for 'hi!!', changes x2 >>> x2 'hi!!' >>> x # No effect on x, still refers to 'hi' 'hi'

x and x2 now have different

values.

Page 36: Lab 8: Sorting, Dictionary

Reference vs. mutable types

1/31/2017 36

Mutable data types (list, dictionaries) behave differently!

Method functions change these data in place, so…

int 1|2|3

x

>>> x = [1,2,3] # x references list [1,2,3] >>> x2 = x # x2 now references what x references >>> x2.append(4) # Changes the original list in memory >>> x2 [1, 2, 3, 4] >>> x # x too points to the changed list! [1, 2, 3, 4]

Page 37: Lab 8: Sorting, Dictionary

Reference vs. mutable types

1/31/2017 37

Mutable data types (list, dictionaries) behave differently!

Method functions change these data in place, so…

int 1|2|3

x

x2

>>> x = [1,2,3] # x references list [1,2,3] >>> x2 = x # x2 now references what x references >>> x2.append(4) # Changes the original list in memory >>> x2 [1, 2, 3, 4] >>> x # x too points to the changed list! [1, 2, 3, 4]

Page 38: Lab 8: Sorting, Dictionary

Reference vs. mutable types

1/31/2017 38

Mutable data types (list, dictionaries) behave differently!

Method functions change these data in place, so…

int 1|2|3|4

x

x2

>>> x = [1,2,3] # x references list [1,2,3] >>> x2 = x # x2 now references what x references >>> x2.append(4) # Changes the original list in memory >>> x2 [1, 2, 3, 4] >>> x # x too points to the changed list! [1, 2, 3, 4]

Page 39: Lab 8: Sorting, Dictionary

Reference vs. mutable types

1/31/2017 39

Mutable data types (list, dictionaries) behave differently!

Method functions change these data in place, so…

int 1|2|3|4

x

x2

x and x2 refer to the same object

in memory! When x is modified,

x2 also changes

>>> x = [1,2,3] # x references list [1,2,3] >>> x2 = x # x2 now references what x references >>> x2.append(4) # Changes the original list in memory >>> x2 [1, 2, 3, 4] >>> x # x too points to the changed list! [1, 2, 3, 4]

Page 40: Lab 8: Sorting, Dictionary

Copying a list content, 1st try

1/31/2017 40

>>>

>>>

>>>

>>>

>>>

['Homer', 'Marge', 'Bart', 'Lisa', 'Maggie', 'Abe', 'Mona']

>>>

['Homer', 'Marge', 'Bart', 'Lisa', 'Maggie', 'Abe', 'Mona']

>>>

sim = ['Homer', 'Marge', 'Bart', 'Lisa', 'Maggie']

gran = ['Abe', 'Mona']

allsim = sim

allsim.extend(gran)

allsim

sim

So! How do we create a *new* list

object from an existing list?

Fail!! sim also changed.

Page 41: Lab 8: Sorting, Dictionary

Clone a list through [:]

1/31/2017 41

>>>

>>>

>>>

>>>

>>>

['Homer', 'Marge', 'Bart', 'Lisa', 'Maggie', 'Abe', 'Mona']

>>>

['Homer', 'Marge', 'Bart', 'Lisa', 'Maggie']

>>>

SUCCESS!

sim = ['Homer', 'Marge', 'Bart', 'Lisa', 'Maggie']

gran = ['Abe', 'Mona']

allsim = sim[:]

allsim.extend(gran)

allsim

sim

[:] returns a whole slice of sim, as a *new list object*

Page 42: Lab 8: Sorting, Dictionary

+ merges 2 lists into a new one

1/31/2017 42

>>>

>>>

>>>

>>>

['Homer', 'Marge', 'Bart', 'Lisa', 'Maggie', 'Abe', 'Mona']

>>>

['Homer', 'Marge', 'Bart', 'Lisa', 'Maggie']

>>>

sim = ['Homer', 'Marge', 'Bart', 'Lisa', 'Maggie']

gran = ['Abe', 'Mona']

allsim = sim + gran

allsim

sim

+ merges two lists and returns it

SUCCESS!

Page 43: Lab 8: Sorting, Dictionary

Creating a *new* list obj from existing

1/31/2017 43

[:] slicing + operator

>>> x = [1,2,3,4,5] >>> x[2:] [3, 4, 5] >>> x[:] [1, 2, 3, 4, 5] >>>

The whole slice [:] clones the entire list and

returns it as a new object.

>>> x = [1,2,3,4] >>> y = [5,6,7] >>> x + y [1, 2, 3, 4, 5, 6, 7] >>> x + [100, 200] [1, 2, 3, 4, 100, 200] >>>

list1 + list2 returns a new list

created by concatenating the

contents.

Page 44: Lab 8: Sorting, Dictionary

Try it out

1/31/2017 44

2 minutes

>>> x = [1, 2, 3] >>> x2 = x >>> x2.append(4) >>> x2 [1, 2, 3, 4] >>> x [1, 2, 3, 4] >>> x[1:] [2, 3, 4] >>> x[:] [1, 2, 3, 4] >>> x3 = x[:] >>> x3.append(100) >>> x3 [1, 2, 3, 4, 100] >>> x [1, 2, 3, 4] >>>

>>> x = [1,2,3,4] >>> y = [5,6,7] >>> x + y [1, 2, 3, 4, 5, 6, 7] >>> x + [100, 200] [1, 2, 3, 4, 100, 200] >>>

>>> >>> >>> >>> 'hello!!' >>> 'hello' >>>

x = 'hello' x2 = x x2 += '!!' x2 x

Page 45: Lab 8: Sorting, Dictionary

1/31/2017 45

Congratulations!

You have now learned all of essential Python.

From this point on, we will focus on: Applying the knowledge to real-world problems How to process text for basic statistics (already started)

How to work with a corpus

How to search through text data

Learn a few additional tricks List comprehension

Pickling

Regular expressions

Page 46: Lab 8: Sorting, Dictionary

It is time to REVIEW

1/31/2017 46

We have learned a WHOLE LOT so far.

First of all, you must KNOW WHAT WE LEARNED.

This is a GOOD time to REVIEW and make sure you have a good command of Python basics.

Secondly, you must BE ABLE TO SYNTHESIZE from what you know.

As you review the slides and tutorials, you will find yourself nodding along.

But reading comprehension gets you only so far you should write them yourself!

Next level up, you should be able to apply your coding skills to NOVEL PROBLEMS

wd = 'penguin' rev = '' for i in wd : rev = i + rev print(rev) print(rev)

Page 47: Lab 8: Sorting, Dictionary

A common misconception

1/31/2017 47

Rather than trying to commit everything to memory, you should: have a good overall knowledge of programming structure and

practices

be willing to explore

have your references handy – know how to look stuff up!

TRUTH:

Nobody, even

seasoned

programmers,

codes solely from

memory.

Wow. So many commands. How am I going to memorize

all these…!!!

Page 48: Lab 8: Sorting, Dictionary

Be more efficient

1/31/2017 48

Do you find coding tedious?

Make sure you are:

Fully utilizing the SHELL side for testing.

Remember the "Rookie vs. Pro ways" video?

Utilizing your command-line history.

Don't re-type your commands!

Change Ctrl+p/n to up arrow () and down arrow ()

Using dir() and help().

Using TAB completion and tooltips.

Will show you in a moment.

Page 49: Lab 8: Sorting, Dictionary

Homework 3 review

1/31/2017 49

Online resources

The power of functions and modular programming

Building up complex data through pipelining

n-gram functions

How to generalize?

Why insist on tuple format?

getTypes() revisited

.keys() not necessary after all!

There are now two ways of building a frequency dictionary.

Page 50: Lab 8: Sorting, Dictionary

Norvig's data: word lists

1/31/2017 50

words.js enable1.txt

What are they?

How big?

Page 51: Lab 8: Sorting, Dictionary

Norvig's data: 1- & 2-grams

1/31/2017 51

count_1w.txt count_2w.txt

Where do they come from?

Page 52: Lab 8: Sorting, Dictionary

COCA n-gram lists

52

2-grams 3-grams 66 a ba 41 a babble 28 a babbling 159 a babe 83 a baboon 9744 a baby 31 a baby-faced 122 a baby-sitter 237 a babysitter 23 a babysitting 95 a baccalaureate 71 a bach 1342 a bachelor 27 a bachelorette 53 a bachelors 1924 a back 38 a back-and-forth 24 a back-door 29 a back-to-basics 27 a back-to-school 100 a back-up 53 a backboard 72 a backbone 93 a backcountry 60 a backdoor

33 a ba in 35 a babble of 33 a babe in 316 a baby and 25 a baby as 73 a baby at 32 a baby before 53 a baby bird 57 a baby boomer 34 a baby born 146 a baby boy 36 a baby brother 29 a baby by 34 a baby can 45 a baby carriage 45 a baby crying 39 a baby doll 47 a baby for 41 a baby from 224 a baby girl 35 a baby grand 30 a baby has 350 a baby in 27 a baby into 216 a baby is

Anything you noticed?

Page 53: Lab 8: Sorting, Dictionary

1/31/2017 53

What did you find?

Page 54: Lab 8: Sorting, Dictionary

Word bigram function

1/31/2017 54

>>> chomtoks ['colorless', 'green', 'ideas', 'sleep', 'furiously', '.'] >>> getWord2Grams(chomtoks) [('colorless', 'green'), ('green', 'ideas'), ('ideas', 'sleep'), ('sleep', 'furiously'), ('furiously', '.')] >>>

def getWord2Grams(wds) : "Takes a tokenized word list, returns a bigram list" bigrams = [] for i in range(len(txt)-1) : gram = tuple(txt[i:i+2]) bigrams.append(gram) return bigrams

Why was it necessary to make the n-grams a

tuple type? What is wrong with

lists?

Page 55: Lab 8: Sorting, Dictionary

It's all about the pipeline

1/31/2017 55

getWord2Grams

[('rose', 'is'), ('is', 'a'),

('a', 'rose'), ('rose', 'is'),

('is', 'a'), ('a', 'rose'),

('rose', '.')]

getFreq

{('rose', 'is'): 2, ('a', 'rose'): 2,

('rose', '.'): 1, ('is', 'a'): 2}

Tuples (immutable!) can be a dictionary key.

Lists and other mutable types cannot.

Eventually, we want to produce an n-gram

frequency dictionary.

Page 56: Lab 8: Sorting, Dictionary

Generalized n-gram function

1/31/2017 56

>>> chomtoks ['colorless', 'green', 'ideas', 'sleep', 'furiously', '.'] >>> getWordNGrams(chomtoks, 4) [('colorless', 'green', 'ideas', 'sleep'), ('green', 'ideas', 'sleep', 'furiously'), ('ideas', 'sleep', 'furiously', '.')] >>>

def getWordNGrams(wds, n) : "Takes a tokenized word list, returns an n-gram list" ngrams = [] for i in range(len(txt)-n+1) : gram = tuple(txt[i:i+n]) ngrams.append(gram) return ngrams

?

Page 57: Lab 8: Sorting, Dictionary

Generalized n-gram function

1/31/2017 57

>>> chomtoks ['colorless', 'green', 'ideas', 'sleep', 'furiously', '.'] >>> getWordNGrams(chomtoks, 4) [('colorless', 'green', 'ideas', 'sleep'), ('green', 'ideas', 'sleep', 'furiously'), ('ideas', 'sleep', 'furiously', '.')] >>>

def getWordNGrams(wds, n) : "Takes a tokenized word list, returns an n-gram list" ngrams = [] for i in range(len(txt)-n+1) : gram = tuple(txt[i:i+n]) ngrams.append(gram) return ngrams

Page 58: Lab 8: Sorting, Dictionary

Building various data objects

1/31/2017 58

'Rose is a rose is a rose is a rose.'

getTokens getTypeFreq getTypes

['rose', 'is', 'a',

'rose', 'is', 'a',

'rose', 'is', 'a',

'rose', '.']

{'a': 3, 'is': 3,

'.': 1, 'rose': 4}

['.', 'a', 'is',

'rose']

getXFreqWords getXLengthWords

['a', 'is', 'rose'] ['is', 'rose']

x = 3 x = 2

Can we build a type frequency dictionary

from a token list?

??

Page 59: Lab 8: Sorting, Dictionary

More than one way to build

1/31/2017 59

'Rose is a rose is a rose is a rose.'

getTokens getTypeFreq getTypes

['rose', 'is', 'a',

'rose', 'is', 'a',

'rose', 'is', 'a',

'rose', '.']

{'a': 3, 'is': 3,

'.': 1, 'rose': 4}

['.', 'a', 'is',

'rose']

getXFreqWords getXLengthWords

['a', 'is', 'rose'] ['is', 'rose']

x = 3 x = 2

getFreq

Yep! Pass it through getFreq(), a

general-purpose frequency counter

function.

Page 60: Lab 8: Sorting, Dictionary

Simplifying functions

1/31/2017 60

def getTypes(txt) : """Takes a piece of text (a single string), returns an alphabetically sorted list of unique word types. """ tfreq = getTypeFreq(txt) return sorted(tfreq.keys())

same as: sorted(tfreq)

sorted() can take a list, string, … and

returns a list. When taking a

dictionary, it returns a sorted list of

dictionary keys.

Page 61: Lab 8: Sorting, Dictionary

Wrapping up

1/31/2017 61

Next class:

Pickling

Handling multiple text files, large text files

Exercise 5

http://www.pitt.edu/~naraehan/ling1330/ex5.html

File I/O with O. Henry

Essentially the same as HW3, but writing results out to a file

HW3 solution will be posted on CourseWeb, as an attachment to the original submission link