Intronlp

download Intronlp

of 26

Transcript of Intronlp

  • 8/12/2019 Intronlp

    1/26

    Natural Language ProcessingRada Mihalcea

    Fall 2011

  • 8/12/2019 Intronlp

    2/26

    Any Light at The End of The Tunnel?

    Yahoo, Google, Microsoft Information Retrieval Monster.com, HotJobs.com (Job finders) Information Extraction +

    Information Retrieval

    Systran powers Babelfish Machine Translation

    Ask Jeeves Question Answering

    Myspace, Facebook, Blogspot Processing of User-Generated Content

    Tools for business intelligence

    All Big Guys have (several) strong NLP research labs: IBM, Microsoft, AT&T, Xerox, Sun, etc.

    Academia: research in an university environment

  • 8/12/2019 Intronlp

    3/26

    Why Natural Language Processing ?

    Huge amounts of data Internet = at least 20

    billions pages

    Intranet

    Applications for

    processing large

    amounts of texts

    requireNLP expertise

    Classify text into categories Index and search large texts

    Automatic translation

    Speech understanding

    Understand phone conversations

    Information extraction

    Extract useful information from resumes

    Automatic summarization

    Condense 1 book into 1 page

    Question answering Knowledge acquisition

    Text generations / dialogues

  • 8/12/2019 Intronlp

    4/26

    Natural?

    Natural Language? Refers to the language spoken by people, e.g. English,

    Japanese, Swahili, as opposed to artificial languages, like

    C++, Java, etc.

    Natural Language Processing Applications that deal with natural language in a way or

    another

    [Computational Linguistics

    Doing linguistics on computers

    More on the linguistic side than NLP, but closely related ]

  • 8/12/2019 Intronlp

    5/26

    Why Natural Language Processing?

    kJfmmfj mmmvvv nnnffn333 Uj iheale eleee mnster vensi credur

    Baboi oi cestnitze

    Coovoel2^ ekk; ldsllk lkdf vnnjfj?

    Fgmflmllk mlfm kfre xnnn!

  • 8/12/2019 Intronlp

    6/26

    Computers Lack Knowledge!

    Computers see text in English the same you haveseen the previous text!

    People have no trouble understanding language

    Common sense knowledge

    Reasoning capacity

    Experience

    Computers have

    No common sense knowledgeNo reasoning capacity

  • 8/12/2019 Intronlp

    7/26

    Where does it fit in the CS taxonomy?

    Computers

    Artificial Intelligence AlgorithmsDatabases Networking

    Robotics SearchNatural Language Processing

    Information

    Retrieval

    Machine

    Translation

    Language

    Analysis

    Semantics Parsing

  • 8/12/2019 Intronlp

    8/26

    Linguistics Levels of Analysis

    Speech Written language

    Phonology: sounds / letters / pronunciation

    Morphology: the structure of wordsSyntax: how these sequences are structured

    Semantics: meaning of the strings

    Interaction between levels

  • 8/12/2019 Intronlp

    9/26

    Issues in Syntax

    the dog ate my homework - Who did what?1. Identify the part of speech (POS)

    Dog = noun ; ate = verb ; homework = noun

    English POS tagging: 95%

    2. Identify collocations

    mother in law, hot dog

    Compositional versus non-compositional

    collocates

  • 8/12/2019 Intronlp

    10/26

    Issues in Syntax

    Shallow parsing:the dog chased the bear

    the dog chased the bear

    subject - predicateIdentify basic structures

    NP-[the dog] VP-[chased the bear]

  • 8/12/2019 Intronlp

    11/26

    Issues in Syntax

    Full parsing: John loves Mary

    Help figuring out (automatically) questions like: Who did what

    and when?

  • 8/12/2019 Intronlp

    12/26

    More Issues in Syntax

    Anaphora Resolution:The dog entered my room. Itscared me

    Preposition Attachment

    I saw the man in the park with a telescope

  • 8/12/2019 Intronlp

    13/26

    Issues in Semantics

    Understand language! How? plant = industrial plant

    plant = living organism

    Words are ambiguous

    Importance of semantics?

    Machine Translation: wrong translations

    Information Retrieval: wrong information

    Anaphora Resolution: wrong referents

  • 8/12/2019 Intronlp

    14/26

    The sea is at the home for billions factories andanimals

    The sea is home to million of plants andanimals

    English French [commercial MT system]

    Le mer est a la maison de billion des usines etdes animaux

    French English

    Why Semantics?

  • 8/12/2019 Intronlp

    15/26

    Issues in Semantics

    How to learn the meaning of words? From dictionaries:

    plant, works, industrial plant -- (buildings for carrying on

    industrial labor; "they built a large plant to manufactureautomobiles")

    plant, flora, plant life -- (a living organism lacking the power of

    locomotion)

    They are producing about 1,000 automobiles in the new plantThe sea flora consists in 1,000 different plant species

    The plant was close to the farm of animals.

  • 8/12/2019 Intronlp

    16/26

    Issues in Semantics

    Learn from annotated examples: Assume 100 examples containing plant

    previously tagged by a human

    Train a learning algorithmHow to choose the learning algorithm?

    How to obtain the 100 tagged examples?

  • 8/12/2019 Intronlp

    17/26

    Issues in Information Extraction

    There was a group of about 8-9 people close tothe entrance on Highway 75

    Who? 8-9 people

    Where? highway 75

    Extract information

    Detect new patterns:Detect hacking / hidden information / etc.

    Gov./mil. puts lots of money put into IEresearch

  • 8/12/2019 Intronlp

    18/26

    Issues in Information Retrieval

    General model: A huge collection of texts

    A query

    Task: find documents that are relevant to the given

    query

    How? Create an index, like the index in a book

    More

    Vector-space models

    Boolean models

    Examples: Google, Yahoo, Altavista, etc.

  • 8/12/2019 Intronlp

    19/26

    Issues in Information Retrieval

    Retrieve specific information Question Answering

    What is the height of mount Everest?

    11,000 feet

  • 8/12/2019 Intronlp

    20/26

    Issues in Information Retrieval

    Find information across languages! Cross Language Information Retrieval

    What is the minimum age requirement for car

    rental in Italy? Search also Italian texts for eta minima per

    noleggio macchine

    Integrate large number of languages

    Integrate into performant IR engines

  • 8/12/2019 Intronlp

    21/26

    Issues in Machine Translations

    Text to Text Machine Translations Speech to Speech Machine Translations

    Most of the work has addressed pairs of widelyspread languages like English-French, English-

    Chinese

  • 8/12/2019 Intronlp

    22/26

    Issues in Machine Translations

    How to translate text?Learn from previously translated data

    Need parallel corpora

    French-English, Chinese-English have theHansards

    Reasonable translations

    Chinese-Hindino such tools available today!

  • 8/12/2019 Intronlp

    23/26

    Even More

    Discourse Summarization

    Subjectivity and sentiment analysis

    Text generation, dialog [pass the Turing test forsome million dollars]Loebner prize

    Knowledge acquisition [how to get that

    common sense knowledge]

    Speech processing

  • 8/12/2019 Intronlp

    24/26

    What will we study this semester?

    Intro to Perl Great great for text processing Fast: one person can do the work of ten others

    Easy to pick up

    Some linguistic basics

    Structure of English Parts of speech, phrases, parsing

    Morphology

    N-grams

    Also multi-word expressions Part of speech tagging

    Syntactic parsing

    Semantics

    Word sense disambiguation

    Semantic relations

  • 8/12/2019 Intronlp

    25/26

    What will we study this semester?

    Information Retrieval Question answering

    Text classification

    Text summarization

    Sentiment analysis

    Depending on time, we may touch on

    Speech recognition Dialogue

    Text generation

    Other topics of your interest

  • 8/12/2019 Intronlp

    26/26

    Administrivia

    Instructor: Rada Mihalcea, F228, [email protected] Class meetings: TTh 11-12:20pm

    Office hours: TTh 4:00-5:00pm

    TA: TBA

    Textbook: Speech and Language Processing, by Jurafskyand Martin (2ndedition)

    Recommended: Statistical Methods in NLP, by Manning andSchutze

    Other readings (papers) may be assigned throughout thesemester

    Grading: Assignments, 2 exams, term project Late submission policy for assignments: can submit up to three days late,

    with 10% penalty / day