Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

112
Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012

Transcript of Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Page 1: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Introduction & Information Theory

Ling570Advanced Statistical Methods in NLP

January 3, 2012

Page 2: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

RoadmapCourse Overview

Information theory

Page 3: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Course Overview

Page 4: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Course InformationCourse web page:

http://courses.washington.edu/ling572

Page 5: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Course InformationCourse web page:

http://courses.washington.edu/ling572Syllabus:

Schedule and readingsLinks to other readings, slides, links to class recordingsSlides posted before class, but may be revised

Page 6: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Course InformationCourse web page:

http://courses.washington.edu/ling572Syllabus:

Schedule and readingsLinks to other readings, slides, links to class recordingsSlides posted before class, but may be revised

Catalyst tools: GoPost discussion board for class issuesCollectIt Dropbox for homework submission and TA

commentsGradebook for viewing all grades

Page 7: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

GoPost Discussion BoardMain venue for course-related questions,

discussionWhat not to post:

Personal, confidential questions; Homework solutions

Page 8: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

GoPost Discussion BoardMain venue for course-related questions,

discussionWhat not to post:

Personal, confidential questions; Homework solutionsWhat to post:

Almost anything else course-related Can someone explain…? Is this really supposed to take this long to run?

Page 9: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

GoPost Discussion BoardMain venue for course-related questions, discussion

What not to post:Personal, confidential questions; Homework solutions

What to post: Almost anything else course-related

Can someone explain…? Is this really supposed to take this long to run?

Key location for class participationPost questions or answersYour discussion space: Michael & I will not jump in often

Page 10: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

GoPostEmily’s 5-minute rule:

If you’ve been stuck on a problem for more than 5 minutes, post to the GoPost!

Page 11: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

GoPostEmily’s 5-minute rule:

If you’ve been stuck on a problem for more than 5 minutes, post to the GoPost!

Mechanics:Please use your UW NetID as your user idPlease post early and often !

Don’t wait until the last minuteKeep up with the GoPost – hard to use

retrospectivelyNotifications:

Decide how you want to receive GoPost postings

Page 12: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

EmailShould be used only for personal or confidential

issuesGrading issues, extended absences, other problems

General questions/comments go on GoPost

Page 13: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

EmailShould be used only for personal or confidential

issuesGrading issues, extended absences, other problems

General questions/comments go on GoPost

Please send email from your UW account Include Ling572 in the subject If you don’t receive a reply in 24 hours (48 on

weekends), please follow-up

Page 14: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Homework SubmissionAll homework should be submitted through CollectIt

Tar cvf hw1.tar hw1_dir

Homework due 11:45 Thursdays

Late homework receives 10%/day penalty (incremental)

Most major programming languages accepted C/C++/C#, Java, Python, Perl, Ruby

If you want to use something else, please check first

Please follow naming, organization guidelines in HW

All programming assignments should run on the CL cluster under Condor

Page 15: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Homework Assignments (Mostly) Implementation tasks designed to get hands-

on understanding of ML approaches

Focus on core concepts, not minute optimizations If gold standard achieves 90.7%, 89.8% is okay

Not scored directly on efficiency, but.. If it’s too slow, hard to debug, test, etc

Not scored on optimal software design either Try to avoid hardcoding, but don’t need complex design

Page 16: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

GradingHomework assignments: 80%

Reading assignments: 10%

Class participation: 10%

No midterm or final exams

One homework assignment may be dropped

Page 17: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

GradesGrades in Catalyst Gradebook

TA feedback returned through CollectIt

Page 18: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

GradesGrades in Catalyst Gradebook

TA feedback returned through CollectIt

Extensions: only for extreme circumstances Illness, family emergencies

Incomplete: only if all work completed up last two weeksUW policy

Page 19: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

WorkloadCLMS courses carry a heavy workload

Ling572 is no exception

Page 20: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

WorkloadCLMS courses carry a heavy workload

Ling572 is no exception

Estimates (per week):~3 hours: Lecture10-12 hours: Homework assignments

Highly variable, depending on prior programming exp.

1-3 hours: Reading + reading assignments

Page 21: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

WorkloadCLMS courses carry a heavy workload

Ling572 is no exception

Estimates (per week): ~3 hours: Lecture 10-12 hours: Homework assignments

Highly variable, depending on prior programming exp. 1-3 hours: Reading + reading assignments

Tracking: GoPost thread for each assignment: please post

Consider automatic time tracker (e.g. ‘hamster’ for linux)

Page 22: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

RecordingsAll classes will be recorded

Links to recordings appear in syllabusAvailable to all students, DL and in class

Page 23: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

RecordingsAll classes will be recorded

Links to recordings appear in syllabusAvailable to all students, DL and in class

Please remind me to:Record the meeting (look for the red dot)Repeat in-class questions

Page 24: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

RecordingsAll classes will be recorded

Links to recordings appear in syllabusAvailable to all students, DL and in class

Please remind me to:Record the meeting (look for the red dot)Repeat in-class questions

Note: Instructor’s screen is projected in classAssume that chat window is always public

Page 25: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Contact InfoGina: Email: [email protected]

Office hour:Fridays: 12:30-1:30 (after Treehouse meeting)Location: Padelford B-201Or by arrangement

Available by Skype or Adobe Connect

Page 26: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Contact InfoGina: Email: [email protected]

Office hour:Fridays: 12:30-1:30 (after Treehouse meeting)Location: Padelford B-201Or by arrangement

Available by Skype or Adobe Connect

TA: Michael Wayne Goodman: Email: [email protected] hour: Time: TBD, see GoPostLocation: Treehouse

Page 27: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Online OptionPlease check you are registered for correct

sectionCLMS in-class: Section AState-funded: Section BCLMS online: Section C

Page 28: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Online OptionPlease check you are registered for correct

sectionCLMS in-class: Section AState-funded: Section BCLMS online: Section C

Online attendance for in-class studentsNot more than 2 times per term (e.g. missed bus,

ice)

Page 29: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Online OptionPlease check you are registered for correct section

CLMS in-class: Section AState-funded: Section BCLMS online: Section C

Online attendance for in-class studentsNot more than 2 times per term (e.g. missed bus, ice)

Please enter meeting room 5-10 before start of classTry to stay online throughout class

Page 30: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Online TipIf you see:

You are not logged into Connect. The problem is one of the following: the permissions on the resource you are trying to access are incorrectly set.Please contact your instructor/Meeting Host/etc. you do not have a Connect account but need to have

one. For UWEO students: If you have just created your UW NetID or just enrolled

in a course…..

Clear your cache, close and restart your browser

Page 31: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Course Description

Page 32: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Course PrerequisitesProgramming Languages:

Java/C++/Python/Perl/..

Operating Systems: Basic Unix/linux

CS 326 (Data structures) or equivalentLists, trees, queues, stacks, hash tables, …Sorting, searching, dynamic programming,..

Stat 391 (Probability and statistics): random variables, conditional probability, Bayes’ rule, ….

Ling 570 (or similar)

Page 33: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Course Prerequisites Programming Languages:

Java/C++/Python/Perl/..

Operating Systems: Basic Unix/linux

CS 326 (Data structures) or equivalent Lists, trees, queues, stacks, hash tables, … Sorting, searching, dynamic programming,..

Stat 391 (Probability and statistics): random variables, conditional probability, Bayes’ rule, ….

Ling 570 (or similar)

If you haven’t taken Ling570 or Ling472, please email me.

Page 34: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

TextbookNo textbook

Online readings

Page 35: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

TextbookNo textbook

Online readings

Reference / Background: Jurafsky and Martin, Speech and Language

Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd edition, 2008Available from UW Bookstore, Amazon, etc

Manning and Schutze, Foundations of Statistical Natural Language ProcessingEarly edition available online through UW library

Page 36: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Course GoalsUnderstand the basis of machine learning

algorithms that achieve state-of-the-art results

Page 37: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Course GoalsUnderstand the basis of machine learning

algorithms that achieve state-of-the-art results

Focus on classification and sequence labeling

Page 38: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Course GoalsUnderstand the basis of machine learning

algorithms that achieve state-of-the-art results

Focus on classification and sequence labeling

Concentrate on basic concepts of machine learning techniques and application to NLP tasks Not a computational learning theory class

Won’t focus on proofs

Page 39: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Model QuestionsMachine learning algorithms

Decision trees and naïve bayesMaxEnt and Support Vector Machines….

Page 40: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Model QuestionsMachine learning algorithms

Decision trees and naïve bayesMaxEnt and Support Vector Machines….

Key questionsWhat is the model?What assumptions does the model make? How many parameters does the model have?

Page 41: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Model QuestionsTraining: How are the parameters learned?

Decoding: How does the model assign values?

Page 42: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Model QuestionsTraining: How are the parameters learned?

Decoding: How does the model assign values?

Pros and Cons:How does the model handle…

outliers? missing data? noisy data? Is it scalable? How long does it take to train? decode?How much training data is needed? Labeled?

Unlabeled?

Page 43: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Tentative Outline for Ling572

Unit #0 (0.5 weeks): Basics Introduction Information theoryClassification review

Page 44: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Outline for Ling572

Unit #0 (0.5 weeks): Basics Introduction Information TheoryClassification review

Unit #1 (3 weeks): Classic Machine LearningK Nearest NeighborsDecision TreesNaïve BayesPerceptrons (?)

Page 45: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Outline for Ling572Unit #3: (4 weeks): Discriminative Classifiers

Feature SelectionMaximum Entropy ModelsSupport Vectors Machines

Page 46: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Outline for Ling572Unit #3: (4 weeks): Discriminative Classifiers

Feature SelectionMaximum Entropy ModelsSupport Vectors Machines

Unit #4: (1.5 weeks): Sequence LearningConditional Random FieldsTransformation Based Learning

Page 47: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Outline for Ling572Unit #3: (4 weeks): Discriminative Classifiers

Feature SelectionMaximum Entropy ModelsSupport Vectors Machines

Unit #4: (1.5 weeks): Sequence LearningConditional Random FieldsTransformation Based Learning

Unit #5: (1 week): Other TopicsSemi-supervised learning,…

Page 48: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Outline for Ling572Topics:

Feature selection approaches

Beam search

Toolkits:Mallet, libSVM

Using binary classifiers for multiclass classification

Page 49: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Early NLPEarly approaches to Natural Language

ProcessingSimilar to classic approaches to Artificial

Intelligence

Page 50: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Early NLPEarly approaches to Natural Language

ProcessingSimilar to classic approaches to Artificial

Intelligence

Reasoning, knowledge-intensive approaches

Page 51: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Early NLPEarly approaches to Natural Language

ProcessingSimilar to classic approaches to Artificial

Intelligence

Reasoning, knowledge-intensive approaches

Largely manually constructed rule-based systems

Page 52: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Early NLPEarly approaches to Natural Language

ProcessingSimilar to classic approaches to Artificial

Intelligence

Reasoning, knowledge-intensive approaches

Largely manually constructed rule-based systems

Typically focused on specific, narrow domains

Page 53: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Early NLP: IssuesRule-based systems:

Page 54: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Early NLP: IssuesRule-based systems:

Too narrow and brittleCouldn’t handle new domains: out of domain -> crash

Page 55: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Early NLP: IssuesRule-based systems:

Too narrow and brittleCouldn’t handle new domains: out of domain -> crash

Hard to maintain and extendLarge manual rule bases incorporate complex

interactionsDon’t scale

Page 56: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Early NLP: IssuesRule-based systems:

Too narrow and brittleCouldn’t handle new domains: out of domain -> crash

Hard to maintain and extendLarge manual rule bases incorporate complex

interactionsDon’t scale

Slow

Page 57: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Reports of the Death of NLP…ALPAC Report: 1966

Automatic Language Processing Advisory Committee

Page 58: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Reports of the Death of NLP…ALPAC Report: 1966

Automatic Language Processing Advisory Committee

Failed systems efforts, esp. MT, lead to defunding

Page 59: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Reports of the Death of NLP…ALPAC Report: 1966

Automatic Language Processing Advisory Committee

Failed systems efforts, esp. MT, lead to defunding

Example: (Probably apocryphal)English -> Russian -> English MT“The spirit is willing but the flesh is weak.”“The vodka is good but the meat is rotten.”

Page 60: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

…Were Greatly Exaggerated

Today:

Watson wins Jeopardy!

SIRI speaks and understands

Google searches and translates

Page 61: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

So What Happened?Statistical approaches and machine learning

Page 62: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

So What Happened?Statistical approaches and machine learning

Hidden Markov Models boosted speech recognition

Page 63: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

So What Happened?Statistical approaches and machine learning

Hidden Markov Models boosted speech recognition

Noisy channel model gave statistical MT

Page 64: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

So What Happened?Statistical approaches and machine learning

Hidden Markov Models boosted speech recognition

Noisy channel model gave statistical MT

Unsupervised topic modeling

Etc

Page 65: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

So What Happened?Many stochastic approaches developed 80s-90s

Rise of machine learning accelerated 2000-present

Why?

Page 66: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

So What Happened?Many stochastic approaches developed 80s-90s

Rise of machine learning accelerated 2000-present

Why?Large scale data resources

Web dataTraining corpora: Treebank, TimeML, Discourse

treebankWikipedia, etc

Page 67: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

So What Happened?Many stochastic approaches developed 80s-90s

Rise of machine learning accelerated 2000-present

Why?Large scale data resources

Web dataTraining corpora: Treebank, TimeML, Discourse treebankWikipedia, etc

Large scale computing resourcesProcessors, storage, memory: local and cloud

Page 68: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

So What Happened?Many stochastic approaches developed 80s-90s

Rise of machine learning accelerated 2000-present

Why? Large scale data resources

Web dataTraining corpora: Treebank, TimeML, Discourse treebankWikipedia, etc

Large scale computing resourcesProcessors, storage, memory: local and cloud

Improved learning algorithmsSupervised, semisupervised, unsupervised, structured…

Page 69: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Information Theory

Page 70: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

EntropyCan be used a measure of

Match of model to data

How predictive an n-gram model is of next word

Comparison between two models

Difficulty of a speech recognition task

Page 71: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

EntropyInformation theoretic measure

Measures information in model

Conceptually, lower bound on # bits to encode

Page 72: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

EntropyInformation theoretic measure

Measures information in model

Conceptually, lower bound on # bits to encode

Entropy: H(X): X is a random var, p: prob fn

)(log)()( 2 xpxpXHXx

Page 73: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

EntropyInformation theoretic measure

Measures information in grammar

Conceptually, lower bound on # bits to encode

Entropy: H(X): X is a random var, p: prob fn

E.g. 8 things: number as code => 3 bits/trans Alt. short code if high prob; longer if lower

Can reduce

)(log)()( 2 xpxpXHXx

Page 74: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Computing EntropyPicking horses (Cover and Thomas)

Send message: identify horse - 1 of 8If all horses equally likely, p(i)

Page 75: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Computing EntropyPicking horses (Cover and Thomas)

Send message: identify horse - 1 of 8If all horses equally likely, p(i) = 1/8

Page 76: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Computing EntropyPicking horses (Cover and Thomas)

Send message: identify horse - 1 of 8If all horses equally likely, p(i) = 1/8

Page 77: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Computing EntropyPicking horses (Cover and Thomas)

Send message: identify horse - 1 of 8If all horses equally likely, p(i) = 1/8

Page 78: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Computing EntropyPicking horses (Cover and Thomas)

Send message: identify horse - 1 of 8If all horses equally likely, p(i) = 1/8

Page 79: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Computing EntropyPicking horses (Cover and Thomas)

Send message: identify horse - 1 of 8If all horses equally likely, p(i) = 1/8

8

1

38/1log8/1log8/1)(i

bitsXH

Page 80: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Computing EntropyPicking horses (Cover and Thomas)

Send message: identify horse - 1 of 8If all horses equally likely, p(i) = 1/8

Some horses more likely:1: ½; 2: ¼; 3: 1/8; 4: 1/16; 5,6,7,8: 1/64

8

1

38/1log8/1log8/1)(i

bitsXH

Page 81: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Computing EntropyPicking horses (Cover and Thomas)

Send message: identify horse - 1 of 8If all horses equally likely, p(i) = 1/8

Some horses more likely:1: ½; 2: ¼; 3: 1/8; 4: 1/16; 5,6,7,8: 1/64

8

1

38/1log8/1log8/1)(i

bitsXH

Page 82: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Computing EntropyPicking horses (Cover and Thomas)

Send message: identify horse - 1 of 8If all horses equally likely, p(i) = 1/8

Some horses more likely:1: ½; 2: ¼; 3: 1/8; 4: 1/16; 5,6,7,8: 1/64

0, 10, 110, 1110, 111100, 111101, 111110, and 111111.

bitsipipXHi

2)(log)()(8

1

8

1

38/1log8/1log8/1)(i

bitsXH

Page 83: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Entropy of a SequenceBasic sequence

)(log)(1

)(1

1211

1

n

LW

nn WpWpn

WHn n

Page 84: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Entropy of a SequenceBasic sequence

Entropy of language: infinite lengthsAssume stationary & ergodic

Shannon-Breiman-Mcmillan Theorem

)(log)(1

)(1

1211

1

n

LW

nn WpWpn

WHn n

Page 85: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Entropy of a SequenceBasic sequence

Entropy of language: infinite lengthsAssume stationary & ergodic

Shannon-Breiman-Mcmillan Theorem

)(log)(1

)(1

1211

1

n

LW

nn WpWpn

WHn n

),...,(log1

lim)(

),...,(log),...,(1

lim)(

1

11

nn

nLW

nn

wwpn

LH

wwpwwpn

LH

Page 86: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Entropy of EnglishShannon’s experiment

Subjects guess strings of letters, count guessesEntropy of guess seq = Entropy of letter seq1.3 bits; Restricted text

Page 87: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Entropy of EnglishShannon’s experiment

Subjects guess strings of letters, count guessesEntropy of guess seq = Entropy of letter seq1.3 bits; Restricted text

Build stochastic model on text & computeBrown computed trigram model on varied corpusCompute (per-char) entropy of model1.75 bits

Page 88: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Cross-EntropyComparing models

Actual distribution unknown pUse simplified model to estimate m

Closer match will have lower cross-entropy

Page 89: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Cross-EntropyComparing models

Actual distribution unknown pUse simplified model to estimate m

Closer match will have lower cross-entropy

Page 90: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Cross-EntropyComparing models

Actual distribution unknown pUse simplified model to estimate m

Closer match will have lower cross-entropy

Page 91: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Cross-EntropyComparing models

Actual distribution unknown pUse simplified model to estimate m

Closer match will have lower cross-entropy

Page 92: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Cross-EntropyComparing models

Actual distribution unknown pUse simplified model to estimate m

Closer match will have lower cross-entropy

Page 93: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Cross-EntropyComparing models

Actual distribution unknown pUse simplified model to estimate m

Closer match will have lower cross-entropy

Page 94: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Relative EntropyCommonly known as Kullback-Liebler divergence

Expresses difference between probability distributions

Page 95: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Relative EntropyCommonly known as Kullback-Liebler divergence

Expresses difference between probability distributions

Page 96: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Relative EntropyCommonly known as Kullback-Liebler divergence

Expresses difference between probability distributions

Not a proper distance metric:

Page 97: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Relative EntropyCommonly known as Kullback-Liebler divergence

Expresses difference between probability distributions

Not a proper distance metric: asymmetricKL(p||q) != KL(q||p)

Page 98: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Joint & Conditional Entropy

Joint entropy:

Page 99: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Joint & Conditional Entropy

Joint entropy:

Conditional entropy:

Page 100: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Joint & Conditional Entropy

Joint entropy:

Conditional entropy:

Page 101: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Joint & Conditional Entropy

Joint entropy:

Conditional entropy:

Page 102: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Perplexity and Entropy

Given that

Consider the perplexity equation:

PP(W) = P(W)-1/N =

Page 103: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Perplexity and Entropy

Given that

Consider the perplexity equation:

PP(W) = P(W)-1/N =

Page 104: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Perplexity and Entropy

Given that

Consider the perplexity equation:

PP(W) = P(W)-1/N = = =

Page 105: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Perplexity and Entropy

Given that

Consider the perplexity equation:

PP(W) = P(W)-1/N = = = 2H(L,P)

Where H is the entropy of the language L

Page 106: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Mutual InformationMeasure of information in common between two

distributions

Page 107: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Mutual InformationMeasure of information in common between two

distributions

Page 108: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Mutual InformationMeasure of information in common between two

distributions

Page 109: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Mutual InformationMeasure of information in common between two

distributions

Page 110: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Mutual InformationMeasure of information in common between two

distributions

Symmetric: I(X;Y) = I(Y;X)

Page 111: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Mutual InformationMeasure of information in common between two

distributions

Symmetric: I(X;Y) = I(Y;X)

I(X;Y) = KL(p(x,y)||p(x)p(y))

Page 112: Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.

Next TimeA little review

Decision TreesApplications of entropy