TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane...

Text statistics 7Day 30 - 11/05/14LING 3820 & 6820

Natural Language Processing

Harry Howard

Tulane University

Course organization

03-Nov-2014NLP, Prof. Howard, Tulane University

http://www.tulane.edu/~howard/LING3820/ The syllabus is under construction. http://www.tulane.edu/~howard/

CompCultEN/ Chapter numbering

3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode

characters 6. Control

Final project

Open Spyder

03-Nov-2014

NLP, Prof. Howard, Tulane University

Review

03-Nov-2014

ConditionalFreqDist

1. >>> from nltk.corpus import brown

2. >>> from nltk.probability import ConditionalFreqDist

3. >>> cat = ['news', 'romance']

4. >>> catWord = [(c,w)

5. for c in cat

6. for w in brown.words(categories=c)]

7. >>> cfd=ConditionalFreqDist(catWord)

Conditional frequency distribution

03-Nov-2014

A more interesting example

can could may might must will

news 93 86 66 38 50 389

religion 82 59 78 12 54 71

hobbies 268 58 131 22 83 264

sci fi 16 49 4 12 8 16

romance 74 193 11 51 45 43

humor 16 30 8 8 9 13

Conditions = categories, sample = modal verbs

1. # from nltk.corpus import brown2. # from nltk.probability import

ConditionalFreqDist3. >>> cat = ['news', 'religion', 'hobbies',

'science_fiction', 'romance', 'humor']4. >>> mod = ['can', 'could', 'may', 'might',

'must', 'will']5. >>> catWord = [(c,w)6. for c in cat7. for w in brown.words(categories=c)8. if w in mod]9. >>> cfd = ConditionalFreqDist(catWord)10. >>> cfd.tabulate()11. >>> cfd.plot()

cfd.tabulate()

can could may might must will

news 93 86 66 38 50 389

religion 82 59 78 12 54 71

hobbies 268 58 131 22 83 264

science_fiction 16 49 4 12 8 16

romance 74 193 11 51 45 43

humor 16 30 8 8 9 13

cfd.plot()

Another example

The task is to find the frequency of 'America' and 'citizen' in NLTK's corpus of presedential inaugural addresses:1. >>> from nltk.corpus import inaugural2. >>> inaugural.fileids()

3. ['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', ..., '2009-Obama.txt']

cfd2.plot()

First try

1. from nltk.corpus import inaugural

2. from nltk.probability import ConditionalFreqDist

3. keys = ['america', 'citizen']

4. keyYear = [(w, title[:4])

5. for title in inaugural.fileids()

6. for w in inaugural.words(title)

7. if w.lower() in keys]

8. cfd2 = ConditionalFreqDist(keyYear)

9. cfd2.plot()

cfd2.plot()

Second try

1. from nltk.corpus import inaugural2. from nltk.probability import

ConditionalFreqDist3. keys = ['america', 'citizen']4. keyYear = [(key, title[:4])5. for title in inaugural.fileids()6. for w in inaugural.words(title)7. for k in keys8. if w.lower().startswith(k)]9. cfd3 = ConditionalFreqDist(keyYear)10. cfd3.plot()

dfc3.plot()

Stemming

Third try

1. from nltk.stem.snowball import EnglishStemmer

2. stemmer = EnglishStemmer()

3. from nltk.corpus import inaugural

4. from nltk.probability import ConditionalFreqDist

5. keys = ['america', 'citizen']

6. keyYear = [(w, title[:4])

7. for title in inaugural.fileids()

8. for w in inaugural.words(title)

9. if stemmer.stem(w) in keys]

10. cfd4 = ConditionalFreqDist(keyYear)

11. cfd4.plot()

cfd4.plot()

Twitter

Next time

TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane...

Documents

Transcript of TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane...

TM 5-3820-239-15

3820 lecture chapter_3_part1_2004.pdf

Fs 3830 3820 Operation Guide

COMPUTATION WITH STRINGS 4 DAY 5 - 9/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

6820 / 6820EVS

Social Security: ssa-3820

DRA Policy Oracle 910-6820-001_rev_a

6820 Aastha Thukral Acc

UNICODE DAY 12 - 9/22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

WEB TEXT DAY 34 - 11/14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

hp deskjet 3820 seriesh10032. · English 1 introducing the hp deskjet 3820 series printer what’s in the box Thank you for buying an hp deskjet 3820 series printer! Your printer

TULANE NAVAL ROTC ALUMNI ASSOCIATION TULANE UNIVERSITY€¦ · TULANE NAVAL ROTC ALUMNI ASSOCIATION TULANE UNIVERSITY ... LCDR Gregory W. Kahn, USN (ret.) 1966 New Orleans, LA Mr.

Paper 3820-2015

INTRODUCTION TO THE COURSE DAY 1 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

COMPUTATION WITH STRINGS 3 DAY 4 - 9/03/14 LING 3820 6820 Natural Language Processing Harry Howard Tulane University.

RNC 3820 Commissioning

REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

2015-16 Tulane Women s Basketball Tulane University Women’s Basketball Game Notes // 2 TULANE GREEN WAVE WOMEN’S BASKETBALL NOTES STATE OF TULANE WOMEN’S BASKETBALL

SCRIPTS & FUNCTIONS DAY 18 - 10/06/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

SNI 01-3820-1995 Sosis Daging