NLP Crash Course

Post on 11-Aug-2014

188 views 24 download

Tags:

description

Charlie Greenbacker, founder and co-organizer of the DC NLP meetup group, provides a "crash course" in Natural Language Processing techniques and applications.

Transcript of NLP Crash Course

NLP “Crash Course”Charlie Greenbacker

dcnlp.org

Agenda

• Introduction & Motivation

• Famous Examples

• Basics

• Major Task Areas

• Protips

• Resources

Introduction& Motivation

By “NLP” we mean...

Natural Language Processing(#NLProc)

aka Computational Linguistics, Text Analytics, etc.

not Neuro-linguistic Programming! (#NLP)

Introduction& Motivation

Natural Language Processing is...

Using computers to process (i.e., analyze, understand, generate, etc.) natural human languages (e.g., English, Chinese, Klingon).

Hello, world! 你好,世界!

That sounds hard... why should I care?

• Most of the knowledge created by humans is unstructured text (information overload)

• Need some way to make sense of it all

• Enable quantitative analysis of text data

Introduction& Motivation

Famous Examples

Siri (Apple, SRI, Nuance)Speech Recognition/Generation

IBM WatsonQuestion Answering

Google TranslateMachine Translation

Basics

• Segmentation

• Part-of-speech tagging

• Noun phrase (NP) chunking

• Parsing

• Word sense disambiguation

Basics

• Stop words, stemming/lemmatization

• Frequency analysis(terms, ngrams, TF-IDF)

• Machine learning (classification, clustering, recommendation)

Major Task Areas

Question Answering

• Match query with knowledge base

• Closed domain vs open domain

• Reasoning about intent of question

Major Task Areas

Speech Recognition

• Speech to text

• Trained/untrained user models

• Voice-based interfaces

Major Task Areas

Named Entity Recognition

• Entity extraction

• Persons, organizations, location

• Grammar, syntax, phrasing

Major Task Areas

Entity Resolution

• Linking names to ground truth

• Disambiguating similar names

Major Task Areas

Co-reference Resolution

• Finding antecedents for pronouns

• Name resolution

Major Task Areas

Relationship Extraction

• Attribute values

• SVO triples

• Populating ontologies

Major Task Areas

Information Retrieval

• Query expansion

• Relevancy of results

• “More like this”

Major Task Areas

Assistive Technologies

• Text simplification

• Predictive text input

• Alternative interfaces

Major Task Areas

NLG + Automatic Summarization

• Generating text from data

• Extractive summarization

• Abstractive summarization

Major Task Areas

Machine Translation

• From source to target, and back!

• Single terms work... sometimes

• Idioms, metaphors, cultural references

Major Task Areas

Sentiment Analysis

• Polarity, intensity, direction

• "Easy" for movie/product reviews

• "Impossible" for nearly anything else

Protips

• Domain adaptation(retrain your models, social media != news)

• Assume everything is in beta(error rates compound, translate last, consult the research literature)

• Evaluation is essential(human judges, “gold standard” data,cross-validation, appropriate metrics)

Resources(toolkits)

Stanford CoreNLPJava, GPL

Apache OpenNLPJava, Apache License

NLTKPython, Apache License

Resources(books)

Natural LanguageProcessing with PythonBird, Klein, and Loper

Speech and Language______________Processing______________

Jurafsky and Martin______________

Foundations of StatisticalNatural Language ProcessingManning and Schütze

Resources(groups)

ACL (Association for Computational Linguistics)Conferences, Workshops, Journals, SIGs

DC NLPNLP Meetups

Data Community DCNLP Workshops

Questions?

Charlie Greenbackerdcnlp.org

@greenbacker