Natural Language Processing and Communication Björn …
Transcript of Natural Language Processing and Communication Björn …
Natural Language Processingand Communication
Björn Gambäck
Department of Computer and Information ScienceNorwegian University of Science and TechnologyTrondheim, Norway
and
SICS, Swedish Institute of Computer Science ABStockholm, Sweden
Björn Gambäck
2
Two Main Reasons to Process Natural Languages
Allow computer agents to communicate with people (Ch 23)
Allow agents to acquire information from (written) language (Ch 22)
Björn Gambäck
3
Languages are for Communication
A speaker must put words to his/her thoughts
A hearer must recognize the thoughtsexpressed from the words he/she perceives
Both presupposes: Capacity to recognize systematic connections between meaning and linguistic form
Björn Gambäck
4
Two Fundamental Traits of Natural Language Processing
Make it easier for people to communicate with computers
Make it easier for people to communicate with people
Björn Gambäck
5
Two Fundamental Traits ofHuman Languages Ambiguity
A word or a string of wordshas more than one meaning
RedundancyThe same informationis expressed more than once
Björn Gambäck
6
Forget About It!FBI Technician:What’s ”forget about it?”Donnie Brasco:”Forget about it” is like if you agree with someone, you know, like ”Raquel Welsh is one great piece of ass forget about it.”But then, if you disagree, like ”A Lincoln is better than a Cadillac? Forget about it!” you know?But then, it's also like if something’s the greatest thing in the world, like Mingio’s Peppers, ”forget about it.”But it’s also like saying ”Go to hell!” too. Like, you know, like ”Hey Paulie, you got a one inch pecker?” and Paulie says ”Forget about it!”Sometimes it just means forget about it.
Björn Gambäck
7
Björn Gambäck? Who?! Swedish and US Highschools (Nacka; Champaign, Illinois) MSc (civ. ing.) Computer Science & Engineering, KTH, Stockholm Linguistics, Computational Linguistics Stockholm U PhD (tekn. dr.) Computer & System Sciences, KTH, Stockholm
SICS, Swedish Institute of Computer Science AB, Stockholm, 1989- U Saarbrücken, Germany 1995-96, Helsinki U, Finland 1997-99,
KTH, Stockholm, Sweden 1997-99, Addis Ababa U, Ethiopia 2004 NTNU, 2008- (Prof. Language Technology)
proud father decent chess player lousy football referee...
Björn Gambäck
8
Pensum / Literature
• These slides
Further Reading:• Russell & Norvig 2010:
– “Natural Language Processing” (Ch 22)– “Natural Language for Communication” (Ch 23)
• Gambäck 1999:– “Human Language Technology: The Babel Fish”
www.idi.ntnu.no/~gamback/teaching/TDT4171/gamback_1999.pdf
Björn Gambäck
9
’The Babel Fish’said The Hitch Hiker’s Guide to the Galaxy quietly,is small, yellow and leech-like, and probably the oddiest thing in the Universe. It feeds on brainwave energy received not from its own carrier but from those around it. It absorbs all unconscious and mental frequencies from this brainwave energy to nourish itself with. It then excretes into the mind of the carrier a telepathic matrix formed by combining the conscious thought frequencies with nerve signals picked up from the speech centres of the brain which has supplied them.The practical upshot of all this is that if you stick a Babel fish in your ear you can instantly understand anything said to you in any form of language. The speech patterns you actually hear decode the brainwave matrix which has been fed into your mind by the Babel fish.’(Douglas Adams: ”The Hitch Hiker’s Guide to the Galaxy”)
Björn Gambäck
10
The Parts of the Babel Fish
1. Recognize the spoken utterances: distinguish the words in the sound waves.
2. Extract the meaning of the sentences: identify the speaker’s intentions.
3. Translate the intended meaning into an utterance in another language.
4. Speak it out.
Björn Gambäck
11
Speech Recognition Microphone
Large databases Pattern matching Transformations Search Language models
Björn Gambäck
12
Speech Synthesis
Streams of connected words
Coarticulation Prosody (intonation) Large databases Canned utterances
Björn Gambäck
13
Semantics
Syntax how signs are related to each other
Semantics how signs are related to things
Pragmatics how signs are related to people
Mr. Smith is expressive
Björn Gambäck
16
Analysis Width
morphemes
words
phrases
sentences
paragraphs
texts
car-s
cars
see the cars
John doesn’t see the cars.
Three sports cars are speeding down the street. John doesn’t see the cars.He steps out into the street…
Our story is about a short-sighted man named John.He lives in a small city with narrow streets. One dayJohn goes for a walk.
Three sports cars are speeding down the street. John doesn’t see the cars.He steps out into the street…
Björn Gambäck
17
The Research Frontier
Syntax
CompositionalSemantics
SituationalSemantics
Pragmatics
morphemes
words
phrases
sentences
paragraphs
texts
Björn Gambäck
18
NLP Applications
What makes an application a language processing application(as opposed to any other piece of software)?
An application that requires the use of knowledge about human languages
Is Unix wc (word count) an example of a language processing application?
Björn Gambäck
19
Applications: Word Count?
When it counts words: Yes To count words you need to know what a word is. That’s knowledge of language.
When it counts lines and bytes: No Lines and bytes are computer artifacts, not
linguistic entities
Björn Gambäck
20
Some NLP applications
(What kind of knowledge of language is needed?) Text-to-speech Speech Recognition OCR Information Retrieval Information Extraction Machine Translation Dialogue Systems ...
Björn Gambäck
21
What is a language?
There are 6000-8000 languages in the World. (Why are the figures not more specific than that?)
There are 82 languages in Ethiopia. (How can we be sure of that? - Why not 80 or 85?)
How many languages are there in Norway?!
11! (according to the Ethnologue):Norwegian: Bokmål, Nynorsk; Norwegian Sign Language,Finnish: KvenRomani: Tavringer, Vlax; Norwegian TravellerSaami: Lule, Pite, North, South
Björn Gambäck
22
One or two individuals (languages)?
How can you tell if a person speaks the same language as yourself or if she speaks another, different language?
Do two speakers of the same language always speak alike? Is it always impossible to understand a person who speaks another language?
Björn Gambäck
23
No, they don’t speak alike ...
several domains(farming, computer science, medicine, …)
dialects sociolects other sub-languages ...
Björn Gambäck
24
No, sometimes you understand...
you know the language(from school / social contacts / your grandfather / …)
the languages are more or less similar: long contact close relationship strong influence
Björn Gambäck
25
Intelligibility
It may be hard to understand a speaker of the same language. It may be easy to understand a speaker of a different language.
Mutual intelligibility and other linguistic factors are never enough if we are trying to establish whether a linguistic variety is a language or not.
Björn Gambäck
26
Conclusion:An army and a navy ...
Political and cultural factors often more important “A language is a dialect with an army and a navy”
Björn Gambäck
27
http://www.youtube.com/watch?v=12rNbGf2Wwo
Language Processing Systems
IBM's Watson system plays Jeopardy!
Björn Gambäck
28
Analysis of Natural Languages
Syntaxactual structure of an utterance
Parsingbest possible way to make an analysis of an
utterance
Semanticsrepresentation of the meaning of an utterance
(e.g., in a logical form)
Björn Gambäck
29
Parsing Natural Languages
Highly ambiguous(in contrast to artifical languages)
Analysis problem more complex
The solutions are often based on different ways ofsaving already obtained partial parses
Björn Gambäck
30
Grammatical vs. Meaningful Sentences
Belonging to the string set* brown sleeps blue dog the
Grammatical (belonging to the language)? The blue brown blue brown blue dog sleeps
Understandable The blue dog sleeps
Meaningful The brown dog sleeps
Björn Gambäck
31
Grammar Coverage
Coverage is never complete Add more rules…
“All grammars leak” More specific rules Add more features
Björn Gambäck
32
Syntactic Ambiguity
Joe said that Martha expected that it would rain yesterday
She asked him or she persuaded himto leave
He knew the girl left
Tycker du om Line? Vad tycker du om Line?
Björn Gambäck
33
Lexical Ambiguity
I made her duck
her - possessive pronoun; her - object pronoun duck - verb; duck - noun make = create; make = cook ...
Björn Gambäck
34
Structural Ambiguity
I saw a man in the park with a telescope
I saw a man in [the park with a telescope] I saw [a man] in the park [with a telescope] I [saw] a man in the park [with a telescope] ...
Björn Gambäck
35
Ambiguity
What’s the use?
How can people understand each other?
How does ambiguity affect our attempts to teach computers to understand language?
Björn Gambäck
36
Redundancy
We discussed computers yesterday
Den gula bilen
(pseudo-)Amharic: Man-the he-died
Björn Gambäck
39
More words ...
New York George W. Bush President George W. Bush The former president George W. Bush vacuum cleaner 67676 123 6 "Gone with the wind" Paris-Dakar
Björn Gambäck
40
Word Meaning
Built in from the start?! Or learnt by observation?
Word usage– in context– by a community
“the meaning of a word is its use in the language”(Ludwig Wittgenstein 1953)
Björn Gambäck
41
Distributional Hypothesis
Words with similar usage have similar meanings Similarity = share contexts
(Zellig Harris 1954, 1968)
Distributional data used to model similarity
“you shall know a word by the company it keeps”(John Rupert Firth 1957)
Björn Gambäck
42
Language Models (LM)
Statistical models of word sequencies
Estimate the probability of a word sequence(given some observed context)
Probability distribution depends on data– Size– Type
Björn Gambäck
43
Language Models, properties
Based on occurencies of units in context of use
Do not rely on pre-compiled knowledge– Statistical calculation on data alone– Little or no supervision
Flexible to change
Björn Gambäck
44
Text corpus
A collection of texts may be used as an LM
Size Genres Domains
Tagged vs untagged
Björn Gambäck
45
Modelling Change
New data will provide new occurencies for the LM
What data are relevant to the domain?
Human language changes fluidly– semantic drift
Lexicalization:– usage gets accepted by most communities– (can give rise to synonyms)
03/25/11
47
Counting: Wordforms
Should “cats” and “cat” count as the same when we’re counting? How about “geese” and “goose”?
Some terminology: Lemma: a set of lexical forms having the same stem,
major part of speech, and rough word sense Wordform: fully inflected surface form
03/25/11
48
Counting: Corpora• Brown et al (1992) large corpus of English text
583 million wordform tokens 293,181 wordform types
Google Crawl of 1,024,908,267,229 English tokens 13,588,391 wordform types
That seems like a lot of types... (even large dictionaries of English have only about 500k types). Why so many here?
• Numbers• Misspellings• Names• Acronyms• etc
Björn Gambäck
49
Predicting the next word
This is a rather long and boring sentence that ends with a totally unknown ???
What kind of data is needed?
Björn Gambäck
50
Basic Problems
Data Sparseness– sequencies don't appear, or– appear with insignificant frequency
Unknown words(OOV = out-of-vocabulary)
Björn Gambäck
51
N-grams
Unigramword
Bigramword1 word2
Trigramword1 word2 word3
N-gramsword1 word2 ... wordn
Björn Gambäck
52
Prediction next word ...
14-grams:4 This is a rather long and boring sentence that ends with a
totally unknown word
1 This is a rather long and boring sentence that ends with a totally unknown elephant
Björn Gambäck
53
Predicting words ...
How long time is needed to collect textsto get these 14-grams frequencies?
How big would the resulting corpus be?
Björn Gambäck
54
Independence assumption
Maybe we don’t need the 14-grams ...
Markov assumption:The item we are trying to predict is dependent only on the X preceeding items
1st order Markov model:P(qi|q1...qi-1) = P(qi|qi-1)
Andrei A. Markov(1856-1922)
Björn Gambäck
55
PRESEMT: Pattern REcognition-based Statistically Enhanced MT• EU project (6 partners in 5 countries)• Applying Machine Learning to Machine
Translation– Genetic Algorithms– Self-Organising Maps– Ant Path Optimisation– Particle Swarm Optimisation
• Word Translation Disambiguation• Large monolingual corpora• Small bilingual corpora