Intronlp
Transcript of Intronlp
-
8/12/2019 Intronlp
1/26
Natural Language ProcessingRada Mihalcea
Fall 2011
-
8/12/2019 Intronlp
2/26
Any Light at The End of The Tunnel?
Yahoo, Google, Microsoft Information Retrieval Monster.com, HotJobs.com (Job finders) Information Extraction +
Information Retrieval
Systran powers Babelfish Machine Translation
Ask Jeeves Question Answering
Myspace, Facebook, Blogspot Processing of User-Generated Content
Tools for business intelligence
All Big Guys have (several) strong NLP research labs: IBM, Microsoft, AT&T, Xerox, Sun, etc.
Academia: research in an university environment
-
8/12/2019 Intronlp
3/26
Why Natural Language Processing ?
Huge amounts of data Internet = at least 20
billions pages
Intranet
Applications for
processing large
amounts of texts
requireNLP expertise
Classify text into categories Index and search large texts
Automatic translation
Speech understanding
Understand phone conversations
Information extraction
Extract useful information from resumes
Automatic summarization
Condense 1 book into 1 page
Question answering Knowledge acquisition
Text generations / dialogues
-
8/12/2019 Intronlp
4/26
Natural?
Natural Language? Refers to the language spoken by people, e.g. English,
Japanese, Swahili, as opposed to artificial languages, like
C++, Java, etc.
Natural Language Processing Applications that deal with natural language in a way or
another
[Computational Linguistics
Doing linguistics on computers
More on the linguistic side than NLP, but closely related ]
-
8/12/2019 Intronlp
5/26
Why Natural Language Processing?
kJfmmfj mmmvvv nnnffn333 Uj iheale eleee mnster vensi credur
Baboi oi cestnitze
Coovoel2^ ekk; ldsllk lkdf vnnjfj?
Fgmflmllk mlfm kfre xnnn!
-
8/12/2019 Intronlp
6/26
Computers Lack Knowledge!
Computers see text in English the same you haveseen the previous text!
People have no trouble understanding language
Common sense knowledge
Reasoning capacity
Experience
Computers have
No common sense knowledgeNo reasoning capacity
-
8/12/2019 Intronlp
7/26
Where does it fit in the CS taxonomy?
Computers
Artificial Intelligence AlgorithmsDatabases Networking
Robotics SearchNatural Language Processing
Information
Retrieval
Machine
Translation
Language
Analysis
Semantics Parsing
-
8/12/2019 Intronlp
8/26
Linguistics Levels of Analysis
Speech Written language
Phonology: sounds / letters / pronunciation
Morphology: the structure of wordsSyntax: how these sequences are structured
Semantics: meaning of the strings
Interaction between levels
-
8/12/2019 Intronlp
9/26
Issues in Syntax
the dog ate my homework - Who did what?1. Identify the part of speech (POS)
Dog = noun ; ate = verb ; homework = noun
English POS tagging: 95%
2. Identify collocations
mother in law, hot dog
Compositional versus non-compositional
collocates
-
8/12/2019 Intronlp
10/26
Issues in Syntax
Shallow parsing:the dog chased the bear
the dog chased the bear
subject - predicateIdentify basic structures
NP-[the dog] VP-[chased the bear]
-
8/12/2019 Intronlp
11/26
Issues in Syntax
Full parsing: John loves Mary
Help figuring out (automatically) questions like: Who did what
and when?
-
8/12/2019 Intronlp
12/26
More Issues in Syntax
Anaphora Resolution:The dog entered my room. Itscared me
Preposition Attachment
I saw the man in the park with a telescope
-
8/12/2019 Intronlp
13/26
Issues in Semantics
Understand language! How? plant = industrial plant
plant = living organism
Words are ambiguous
Importance of semantics?
Machine Translation: wrong translations
Information Retrieval: wrong information
Anaphora Resolution: wrong referents
-
8/12/2019 Intronlp
14/26
The sea is at the home for billions factories andanimals
The sea is home to million of plants andanimals
English French [commercial MT system]
Le mer est a la maison de billion des usines etdes animaux
French English
Why Semantics?
-
8/12/2019 Intronlp
15/26
Issues in Semantics
How to learn the meaning of words? From dictionaries:
plant, works, industrial plant -- (buildings for carrying on
industrial labor; "they built a large plant to manufactureautomobiles")
plant, flora, plant life -- (a living organism lacking the power of
locomotion)
They are producing about 1,000 automobiles in the new plantThe sea flora consists in 1,000 different plant species
The plant was close to the farm of animals.
-
8/12/2019 Intronlp
16/26
Issues in Semantics
Learn from annotated examples: Assume 100 examples containing plant
previously tagged by a human
Train a learning algorithmHow to choose the learning algorithm?
How to obtain the 100 tagged examples?
-
8/12/2019 Intronlp
17/26
Issues in Information Extraction
There was a group of about 8-9 people close tothe entrance on Highway 75
Who? 8-9 people
Where? highway 75
Extract information
Detect new patterns:Detect hacking / hidden information / etc.
Gov./mil. puts lots of money put into IEresearch
-
8/12/2019 Intronlp
18/26
Issues in Information Retrieval
General model: A huge collection of texts
A query
Task: find documents that are relevant to the given
query
How? Create an index, like the index in a book
More
Vector-space models
Boolean models
Examples: Google, Yahoo, Altavista, etc.
-
8/12/2019 Intronlp
19/26
Issues in Information Retrieval
Retrieve specific information Question Answering
What is the height of mount Everest?
11,000 feet
-
8/12/2019 Intronlp
20/26
Issues in Information Retrieval
Find information across languages! Cross Language Information Retrieval
What is the minimum age requirement for car
rental in Italy? Search also Italian texts for eta minima per
noleggio macchine
Integrate large number of languages
Integrate into performant IR engines
-
8/12/2019 Intronlp
21/26
Issues in Machine Translations
Text to Text Machine Translations Speech to Speech Machine Translations
Most of the work has addressed pairs of widelyspread languages like English-French, English-
Chinese
-
8/12/2019 Intronlp
22/26
Issues in Machine Translations
How to translate text?Learn from previously translated data
Need parallel corpora
French-English, Chinese-English have theHansards
Reasonable translations
Chinese-Hindino such tools available today!
-
8/12/2019 Intronlp
23/26
Even More
Discourse Summarization
Subjectivity and sentiment analysis
Text generation, dialog [pass the Turing test forsome million dollars]Loebner prize
Knowledge acquisition [how to get that
common sense knowledge]
Speech processing
-
8/12/2019 Intronlp
24/26
What will we study this semester?
Intro to Perl Great great for text processing Fast: one person can do the work of ten others
Easy to pick up
Some linguistic basics
Structure of English Parts of speech, phrases, parsing
Morphology
N-grams
Also multi-word expressions Part of speech tagging
Syntactic parsing
Semantics
Word sense disambiguation
Semantic relations
-
8/12/2019 Intronlp
25/26
What will we study this semester?
Information Retrieval Question answering
Text classification
Text summarization
Sentiment analysis
Depending on time, we may touch on
Speech recognition Dialogue
Text generation
Other topics of your interest
-
8/12/2019 Intronlp
26/26
Administrivia
Instructor: Rada Mihalcea, F228, [email protected] Class meetings: TTh 11-12:20pm
Office hours: TTh 4:00-5:00pm
TA: TBA
Textbook: Speech and Language Processing, by Jurafskyand Martin (2ndedition)
Recommended: Statistical Methods in NLP, by Manning andSchutze
Other readings (papers) may be assigned throughout thesemester
Grading: Assignments, 2 exams, term project Late submission policy for assignments: can submit up to three days late,
with 10% penalty / day