TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane...

Post on 18-Dec-2015

215 views 1 download

Transcript of TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane...

Text statistics 7Day 30 - 11/05/14LING 3820 & 6820

Natural Language Processing

Harry Howard

Tulane University

Course organization

03-Nov-2014NLP, Prof. Howard, Tulane University

2

http://www.tulane.edu/~howard/LING3820/ The syllabus is under construction. http://www.tulane.edu/~howard/

CompCultEN/ Chapter numbering

3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode

characters 6. Control

Final project

03-Nov-2014NLP, Prof. Howard, Tulane University

3

Open Spyder

03-Nov-2014

4

NLP, Prof. Howard, Tulane University

Review

03-Nov-2014

5

NLP, Prof. Howard, Tulane University

ConditionalFreqDist

1. >>> from nltk.corpus import brown

2. >>> from nltk.probability import ConditionalFreqDist

3. >>> cat = ['news', 'romance']

4. >>> catWord = [(c,w)

5. for c in cat

6. for w in brown.words(categories=c)]

7. >>> cfd=ConditionalFreqDist(catWord)

03-Nov-2014NLP, Prof. Howard, Tulane University

6

Conditional frequency distribution

03-Nov-2014

7

NLP, Prof. Howard, Tulane University

03-Nov-2014NLP, Prof. Howard, Tulane University

8

A more interesting example

can could may might must will

news 93 86 66 38 50 389

religion 82 59 78 12 54 71

hobbies 268 58 131 22 83 264

sci fi 16 49 4 12 8 16

romance 74 193 11 51 45 43

humor 16 30 8 8 9 13

Conditions = categories, sample = modal verbs

1. # from nltk.corpus import brown2. # from nltk.probability import

ConditionalFreqDist3. >>> cat = ['news', 'religion', 'hobbies',

'science_fiction', 'romance', 'humor']4. >>> mod = ['can', 'could', 'may', 'might',

'must', 'will']5. >>> catWord = [(c,w)6. for c in cat7. for w in brown.words(categories=c)8. if w in mod]9. >>> cfd = ConditionalFreqDist(catWord)10. >>> cfd.tabulate()11. >>> cfd.plot()

03-Nov-2014NLP, Prof. Howard, Tulane University

9

cfd.tabulate()

can could may might must will

news 93 86 66 38 50 389

religion 82 59 78 12 54 71

hobbies 268 58 131 22 83 264

science_fiction 16 49 4 12 8 16

romance 74 193 11 51 45 43

humor 16 30 8 8 9 13

03-Nov-2014NLP, Prof. Howard, Tulane University

10

cfd.plot()

03-Nov-2014NLP, Prof. Howard, Tulane University

11

03-Nov-2014NLP, Prof. Howard, Tulane University

12

Another example

The task is to find the frequency of 'America' and 'citizen' in NLTK's corpus of presedential inaugural addresses:1. >>> from nltk.corpus import inaugural2. >>> inaugural.fileids()

3. ['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', ..., '2009-Obama.txt']

03-Nov-2014NLP, Prof. Howard, Tulane University

13

cfd2.plot()

First try

1. from nltk.corpus import inaugural

2. from nltk.probability import ConditionalFreqDist

3. keys = ['america', 'citizen']

4. keyYear = [(w, title[:4])

5. for title in inaugural.fileids()

6. for w in inaugural.words(title)

7. if w.lower() in keys]

8. cfd2 = ConditionalFreqDist(keyYear)

9. cfd2.plot()

03-Nov-2014NLP, Prof. Howard, Tulane University

14

03-Nov-2014NLP, Prof. Howard, Tulane University

15

cfd2.plot()

Second try

1. from nltk.corpus import inaugural2. from nltk.probability import

ConditionalFreqDist3. keys = ['america', 'citizen']4. keyYear = [(key, title[:4])5. for title in inaugural.fileids()6. for w in inaugural.words(title)7. for k in keys8. if w.lower().startswith(k)]9. cfd3 = ConditionalFreqDist(keyYear)10. cfd3.plot()

03-Nov-2014NLP, Prof. Howard, Tulane University

16

dfc3.plot()

03-Nov-2014NLP, Prof. Howard, Tulane University

17

Stemming

03-Nov-2014NLP, Prof. Howard, Tulane University

18

Third try

1. from nltk.stem.snowball import EnglishStemmer

2. stemmer = EnglishStemmer()

3. from nltk.corpus import inaugural

4. from nltk.probability import ConditionalFreqDist

5. keys = ['america', 'citizen']

6. keyYear = [(w, title[:4])

7. for title in inaugural.fileids()

8. for w in inaugural.words(title)

9. if stemmer.stem(w) in keys]

10. cfd4 = ConditionalFreqDist(keyYear)

11. cfd4.plot()

03-Nov-2014NLP, Prof. Howard, Tulane University

19

cfd4.plot()

03-Nov-2014NLP, Prof. Howard, Tulane University

20

Twitter

Next time

03-Nov-2014NLP, Prof. Howard, Tulane University

21