Natural Language Processing and Communication Björn …

55
Natural Language Processing and Communication Björn Gambäck Department of Computer and Information Science Norwegian University of Science and Technology Trondheim, Norway and SICS, Swedish Institute of Computer Science AB Stockholm, Sweden

Transcript of Natural Language Processing and Communication Björn …

Natural Language Processingand Communication

Björn Gambäck

Department of Computer and Information ScienceNorwegian University of Science and TechnologyTrondheim, Norway

and

SICS, Swedish Institute of Computer Science ABStockholm, Sweden

Björn Gambäck

2

Two Main Reasons to Process Natural Languages

Allow computer agents to communicate with people (Ch 23)

Allow agents to acquire information from (written) language (Ch 22)

Björn Gambäck

3

Languages are for Communication

A speaker must put words to his/her thoughts

A hearer must recognize the thoughtsexpressed from the words he/she perceives

Both presupposes: Capacity to recognize systematic connections between meaning and linguistic form

Björn Gambäck

4

Two Fundamental Traits of Natural Language Processing

Make it easier for people to communicate with computers

Make it easier for people to communicate with people

Björn Gambäck

5

Two Fundamental Traits ofHuman Languages Ambiguity

A word or a string of wordshas more than one meaning

RedundancyThe same informationis expressed more than once

Björn Gambäck

6

Forget About It!FBI Technician:What’s ”forget about it?”Donnie Brasco:”Forget about it” is like if you agree with someone, you know, like ”Raquel Welsh is one great piece of ass forget about it.”But then, if you disagree, like ”A Lincoln is better than a Cadillac? Forget about it!” you know?But then, it's also like if something’s the greatest thing in the world, like Mingio’s Peppers, ”forget about it.”But it’s also like saying ”Go to hell!” too. Like, you know, like ”Hey Paulie, you got a one inch pecker?” and Paulie says ”Forget about it!”Sometimes it just means forget about it.

Björn Gambäck

7

Björn Gambäck? Who?! Swedish and US Highschools (Nacka; Champaign, Illinois) MSc (civ. ing.) Computer Science & Engineering, KTH, Stockholm Linguistics, Computational Linguistics Stockholm U PhD (tekn. dr.) Computer & System Sciences, KTH, Stockholm

SICS, Swedish Institute of Computer Science AB, Stockholm, 1989- U Saarbrücken, Germany 1995-96, Helsinki U, Finland 1997-99,

KTH, Stockholm, Sweden 1997-99, Addis Ababa U, Ethiopia 2004 NTNU, 2008- (Prof. Language Technology)

proud father decent chess player lousy football referee...

Björn Gambäck

8

Pensum / Literature

• These slides

Further Reading:• Russell & Norvig 2010:

– “Natural Language Processing” (Ch 22)– “Natural Language for Communication” (Ch 23)

• Gambäck 1999:– “Human Language Technology: The Babel Fish”

www.idi.ntnu.no/~gamback/teaching/TDT4171/gamback_1999.pdf

Björn Gambäck

9

’The Babel Fish’said The Hitch Hiker’s Guide to the Galaxy quietly,is small, yellow and leech-like, and probably the oddiest thing in the Universe. It feeds on brainwave energy received not from its own carrier but from those around it. It absorbs all unconscious and mental frequencies from this brainwave energy to nourish itself with. It then excretes into the mind of the carrier a telepathic matrix formed by combining the conscious thought frequencies with nerve signals picked up from the speech centres of the brain which has supplied them.The practical upshot of all this is that if you stick a Babel fish in your ear you can instantly understand anything said to you in any form of language. The speech patterns you actually hear decode the brainwave matrix which has been fed into your mind by the Babel fish.’(Douglas Adams: ”The Hitch Hiker’s Guide to the Galaxy”)

Björn Gambäck

10

The Parts of the Babel Fish

1. Recognize the spoken utterances: distinguish the words in the sound waves.

2. Extract the meaning of the sentences: identify the speaker’s intentions.

3. Translate the intended meaning into an utterance in another language.

4. Speak it out.

Björn Gambäck

11

Speech Recognition Microphone

Large databases Pattern matching Transformations Search Language models

Björn Gambäck

12

Speech Synthesis

Streams of connected words

Coarticulation Prosody (intonation) Large databases Canned utterances

Björn Gambäck

13

Semantics

Syntax how signs are related to each other

Semantics how signs are related to things

Pragmatics how signs are related to people

Mr. Smith is expressive

Björn Gambäck

14

General NLP System Architecture

User Modeling Dialogue Management

Grammar

Björn Gambäck

15

Analysis Depth

Björn Gambäck

16

Analysis Width

morphemes

words

phrases

sentences

paragraphs

texts

car-s

cars

see the cars

John doesn’t see the cars.

Three sports cars are speeding down the street. John doesn’t see the cars.He steps out into the street…

Our story is about a short-sighted man named John.He lives in a small city with narrow streets. One dayJohn goes for a walk.

Three sports cars are speeding down the street. John doesn’t see the cars.He steps out into the street…

Björn Gambäck

17

The Research Frontier

Syntax

CompositionalSemantics

SituationalSemantics

Pragmatics

morphemes

words

phrases

sentences

paragraphs

texts

Björn Gambäck

18

NLP Applications

What makes an application a language processing application(as opposed to any other piece of software)?

An application that requires the use of knowledge about human languages

Is Unix wc (word count) an example of a language processing application?

Björn Gambäck

19

Applications: Word Count?

When it counts words: Yes To count words you need to know what a word is. That’s knowledge of language.

When it counts lines and bytes: No Lines and bytes are computer artifacts, not

linguistic entities

Björn Gambäck

20

Some NLP applications

(What kind of knowledge of language is needed?) Text-to-speech Speech Recognition OCR Information Retrieval Information Extraction Machine Translation Dialogue Systems ...

Björn Gambäck

21

What is a language?

There are 6000-8000 languages in the World. (Why are the figures not more specific than that?)

There are 82 languages in Ethiopia. (How can we be sure of that? - Why not 80 or 85?)

How many languages are there in Norway?!

11! (according to the Ethnologue):Norwegian: Bokmål, Nynorsk; Norwegian Sign Language,Finnish: KvenRomani: Tavringer, Vlax; Norwegian TravellerSaami: Lule, Pite, North, South

Björn Gambäck

22

One or two individuals (languages)?

How can you tell if a person speaks the same language as yourself or if she speaks another, different language?

Do two speakers of the same language always speak alike? Is it always impossible to understand a person who speaks another language?

Björn Gambäck

23

No, they don’t speak alike ...

several domains(farming, computer science, medicine, …)

dialects sociolects other sub-languages ...

Björn Gambäck

24

No, sometimes you understand...

you know the language(from school / social contacts / your grandfather / …)

the languages are more or less similar: long contact close relationship strong influence

Björn Gambäck

25

Intelligibility

It may be hard to understand a speaker of the same language. It may be easy to understand a speaker of a different language.

Mutual intelligibility and other linguistic factors are never enough if we are trying to establish whether a linguistic variety is a language or not.

Björn Gambäck

26

Conclusion:An army and a navy ...

Political and cultural factors often more important “A language is a dialect with an army and a navy”

Björn Gambäck

27

http://www.youtube.com/watch?v=12rNbGf2Wwo

Language Processing Systems

IBM's Watson system plays Jeopardy!

Björn Gambäck

28

Analysis of Natural Languages

Syntaxactual structure of an utterance

Parsingbest possible way to make an analysis of an

utterance

Semanticsrepresentation of the meaning of an utterance

(e.g., in a logical form)

Björn Gambäck

29

Parsing Natural Languages

Highly ambiguous(in contrast to artifical languages)

Analysis problem more complex

The solutions are often based on different ways ofsaving already obtained partial parses

Björn Gambäck

30

Grammatical vs. Meaningful Sentences

Belonging to the string set* brown sleeps blue dog the

Grammatical (belonging to the language)? The blue brown blue brown blue dog sleeps

Understandable The blue dog sleeps

Meaningful The brown dog sleeps

Björn Gambäck

31

Grammar Coverage

Coverage is never complete Add more rules…

“All grammars leak” More specific rules Add more features

Björn Gambäck

32

Syntactic Ambiguity

Joe said that Martha expected that it would rain yesterday

She asked him or she persuaded himto leave

He knew the girl left

Tycker du om Line? Vad tycker du om Line?

Björn Gambäck

33

Lexical Ambiguity

I made her duck

her - possessive pronoun; her - object pronoun duck - verb; duck - noun make = create; make = cook ...

Björn Gambäck

34

Structural Ambiguity

I saw a man in the park with a telescope

I saw a man in [the park with a telescope] I saw [a man] in the park [with a telescope] I [saw] a man in the park [with a telescope] ...

Björn Gambäck

35

Ambiguity

What’s the use?

How can people understand each other?

How does ambiguity affect our attempts to teach computers to understand language?

Björn Gambäck

36

Redundancy

We discussed computers yesterday

Den gula bilen

(pseudo-)Amharic: Man-the he-died

Björn Gambäck

37

Redundancy

Why?

Is redundancy a problem for NLP?

Björn Gambäck

38

What is a word?

Beginning beginning end end. us US Bush bush

Björn Gambäck

39

More words ...

New York George W. Bush President George W. Bush The former president George W. Bush vacuum cleaner 67676 123 6 "Gone with the wind" Paris-Dakar

Björn Gambäck

40

Word Meaning

Built in from the start?! Or learnt by observation?

Word usage– in context– by a community

“the meaning of a word is its use in the language”(Ludwig Wittgenstein 1953)

Björn Gambäck

41

Distributional Hypothesis

Words with similar usage have similar meanings Similarity = share contexts

(Zellig Harris 1954, 1968)

Distributional data used to model similarity

“you shall know a word by the company it keeps”(John Rupert Firth 1957)

Björn Gambäck

42

Language Models (LM)

Statistical models of word sequencies

Estimate the probability of a word sequence(given some observed context)

Probability distribution depends on data– Size– Type

Björn Gambäck

43

Language Models, properties

Based on occurencies of units in context of use

Do not rely on pre-compiled knowledge– Statistical calculation on data alone– Little or no supervision

Flexible to change

Björn Gambäck

44

Text corpus

A collection of texts may be used as an LM

Size Genres Domains

Tagged vs untagged

Björn Gambäck

45

Modelling Change

New data will provide new occurencies for the LM

What data are relevant to the domain?

Human language changes fluidly– semantic drift

Lexicalization:– usage gets accepted by most communities– (can give rise to synonyms)

Björn Gambäck

46

Counting words

fax faxes

fax faxes

fax faxes

fax faxes

faxes

faxes

03/25/11

47

Counting: Wordforms

Should “cats” and “cat” count as the same when we’re counting? How about “geese” and “goose”?

Some terminology: Lemma: a set of lexical forms having the same stem,

major part of speech, and rough word sense Wordform: fully inflected surface form

03/25/11

48

Counting: Corpora• Brown et al (1992) large corpus of English text

583 million wordform tokens 293,181 wordform types

Google Crawl of 1,024,908,267,229 English tokens 13,588,391 wordform types

That seems like a lot of types... (even large dictionaries of English have only about 500k types). Why so many here?

• Numbers• Misspellings• Names• Acronyms• etc

Björn Gambäck

49

Predicting the next word

This is a rather long and boring sentence that ends with a totally unknown ???

What kind of data is needed?

Björn Gambäck

50

Basic Problems

Data Sparseness– sequencies don't appear, or– appear with insignificant frequency

Unknown words(OOV = out-of-vocabulary)

Björn Gambäck

51

N-grams

Unigramword

Bigramword1 word2

Trigramword1 word2 word3

N-gramsword1 word2 ... wordn

Björn Gambäck

52

Prediction next word ...

14-grams:4 This is a rather long and boring sentence that ends with a

totally unknown word

1 This is a rather long and boring sentence that ends with a totally unknown elephant

Björn Gambäck

53

Predicting words ...

How long time is needed to collect textsto get these 14-grams frequencies?

How big would the resulting corpus be?

Björn Gambäck

54

Independence assumption

Maybe we don’t need the 14-grams ...

Markov assumption:The item we are trying to predict is dependent only on the X preceeding items

1st order Markov model:P(qi|q1...qi-1) = P(qi|qi-1)

Andrei A. Markov(1856-1922)

Björn Gambäck

55

PRESEMT: Pattern REcognition-based Statistically Enhanced MT• EU project (6 partners in 5 countries)• Applying Machine Learning to Machine

Translation– Genetic Algorithms– Self-Organising Maps– Ant Path Optimisation– Particle Swarm Optimisation

• Word Translation Disambiguation• Large monolingual corpora• Small bilingual corpora