Engineering Intelligent NLP Applications Using Deep Learning – Part 1

Engineering

Intelligent NLP

Applications Using

Deep Learning –

Part 1

Saurabh Kaushik

• Part 1:

• Why NLP?

• What is NLP?

• What is the Word & Sentence

Modelling in NLP?

• What is Word Representation in

NLP?

• What is Language Processing in

NLP?

Agenda

• PART 2 :

• WHY DL FOR NLP?

• WHAT IS DL?

• WHAT IS DL FOR NLP?

• HOW RNN WORKS FOR NLP?

• HOW CNN WORKS FOR NLP?

WHY NLP?

What are Generally Known NLP Applications?

Search

Customer Support Q & A

Summarization

Are there More Deeper Applications of NLP?

Group 1 Cleanup, tokenization

Stemming

Lemmatization

Part-of-speech tagging

Query expansion

Parsing

Topic segmentation and recognition

Morphological segmentation (word/Sentences)

Group 2 Information retrieval and Extraction

(IR)

Relationship Extraction

Named entity recognition (NER)

Sentiment analysis /Sentence boundary disambiguation

Word sense and disambiguation

Text similarity

Coreference resolution

Discourse analysis

Group 3

Machine translation

Automatic summarization / Paraphrasing

Natural language generation

Reasoning over Knowledge base

Question answering System

Dialog System

Image Captioning & other multimodal tasks

WHAT IS NLP?

• According to Wikipedia:

• Natural language processing (NLP) is a field

of Computer science and Linguistics

concerned with the

• Interactions between computers and

human (natural) languages.

What is NLP? So far, Computing Device and its Interaction

with Human are two separate thing. But in true

Digital World, this gap needs to bridged by

integrating Human Conversational

Understanding into Intelligent

Apps/Systems/Things, in order to achieve its

true potential.

Ref: https://en.wikipedia.org/wiki/Natural_language_processing

Why Language is so Challenging for Computer?

• Every sentence has many possible interpretations.

Language is

ambiguous

• We will always encounter new words or new constructions

Language is

productive

• Same word has different meaning.

Language is culturally

specific

• Lexical Analysis − It involves identifying and analyzing

the structure of words. Lexicon of a language means the

collection of words and phrases in a language. Lexical

analysis is dividing the whole chunk of txt into

paragraphs, sentences, and words.

• Syntactic Analysis (Parsing) − It involves analysis of

words in the sentence for grammar and arranging words

in a manner that shows the relationship among the

words. The sentence such as “The school goes to boy” is

rejected by English syntactic analyzer.

• Semantic Analysis − It draws the exact meaning or the

dictionary meaning from the text. The text is checked for

meaningfulness. It is done by mapping syntactic

structures and objects in the task domain. The semantic

analyzer disregards sentence such as “hot ice-cream”.

Also called Compositional Semantic.

• Discourse Integration − The meaning of any sentence

depends upon the meaning of the sentence just before it.

In addition, it also brings about the meaning of

immediately succeeding sentence.

• Pragmatic Analysis − During this, what was said is re-

interpreted on what it actually meant. It involves deriving

those aspects of language which require real world

knowledge.

What is NLP Processing?

How does NLP understand Syntactically? Part of Speech – Tagging

WHAT WORD &

SENTENCE MODELLED IN

NLP?

• What is the meaning of words?

• Most words have many different senses:

• E.g. dog = animal or sausage?

How does NLP get Word Meanings? Word Meaning:

• Polysemy:

• A lexeme is polysemous if it has different related senses

• E.g. bank = financial institution or building

• Homonyms:

• Two lexemes are homonyms if their senses are

unrelated, but they happen to have the same spelling

and pronunciation

• E.g. bank = (financial) bank or (river) bank

• How are the meanings of different words related?

• Specific relations between senses:

• E.g. Animal is more general than dog.

• Semantic fields:

• E.g. money is related to bank

How does NLP get Word Relationships? Word Relationships:

Symmetric Relations:

– Synonyms: couch/sofa

Two lemmas with the same sense

– Antonyms: cold/hot, rise/fall, in/out

Two lemmas with the opposite sense

Hierarchical relations:

Hypernyms and Hyponyms: pet/dog

– The hyponym (dog) is more specific than the

hypernym (pet)

Homonyms and Meronyms: car/wheel

– The meronym (wheel) is a part of the holonym (car)

• Principle of compositionality:

• The meaning (vector) of a complex expression (sentence) is determined by:

• the meanings of its constituent expressions (words) and

• the rules (grammar) used to combine them”

How does NLP get Sentence Composability?

• SCENE PARSING:

• THE MEANING OF A SCENE IMAGE IS ALSO A FUNCTION OF SMALLER REGION.

• HOW THEY COMBINE TO FORM AN LARGE OBJECT.

• AND HOW OBJECT INTERACT.

• Sentence Parsing:

• The meaning of a sentence is a function of words.

• How they combine to form an large sentences.

• And how Word Interact in a given sentence.

WHAT IS WORD

REPRESENTATION IN

NLP?

What is basic Linear Representation of Words?

Definition

• Documents are treated as a “bag” of words or

terms.

• Any document can be represented as a vector: a

list of terms and their associated weights

Pros

• Simple Model to start with

Cons

• Disregarding grammar (term.baseform?)

• Disregarding word order (term.position)

• Keeping only multiplicity (term.frequency)

• Less Accurate

Technique : TFIDF:

• Term frequency – inverse document frequency

• TF - is term frequency in a document function - i.e.

measure on how much information the term brings in

one document

• IDF - is inverse document frequency of the term

function - i.e. inversed measure on how much

information the term brings in all documents (corpus)

• Formula:

• t - term, d - one document, D - all documents

Bag of Words

• Statistical Modeling

• Word ordering information lost

• Data sparsity

• Words as atomic symbols

• Very hard to find higher level features

• Features other than BOW

What is Distributed Representation?

Neural Network Modeling

• Trained in a completely unsupervised way

• Reduce data sparsity

• Semantic Hashing

• Appear to carry semantic information about the words

• Freely available for Out of Box usage

Linguistic items with similar distributions have similar meanings. Generally, it is based on co-occurrence/ context and

based on the Distributional hypothesis. Distributional meaning as co-occurrence vector.

What is One Hot Encoding? Definition:

• The vast majority of rule-‐based and Statistical NLP work

regards words as atomic symbols.

• Form vocabulary of words that maps lemmatized words to a

unique ID (position of word in vocabulary).

• Typical vocabulary sizes will vary between 10 000 and 250

000.

• The one-hot vector of an ID is a vector filled with 0s, except

for a 1 at the position associated with the ID.

• ex.: for vocabulary size D=10, the one-hot vector of word

ID w=4 is e(w) = [ 0 0 0 1 0 0 0 0 0 0 ]

• A one-hot encoding makes no assumption about word

similarity. All words are equally different from each other.

Pros

• Simplicity

Cons

• Notion of word similarity is undefined with one-hot encoding social [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]

public [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]

• Impossible to generalize to unseen words

• One-hot encoding can be memory inefficient

• One of the most successful ideas of modern statistical NLP!

What is Word Embedding? “You shall know a word by the company it keeps”

(J. R. Firth 1957)

these words represent banking

Definition:

• Help to find Syntactical as well as Semantical Similarity

Pros

• Simplicity

• Possible to generalize to unseen words

Cons

• All words are equal, but some words are more equal than

others.

What is Word Embedding?

Cosine similarity

Vector Representation

• Allow ability to map each document in a corpus to a n-

dimensional vector, where n is the size of the

vocabulary.

• represent each unique word as a dimension and the

magnitude along this dimension is the count of that

word in the document.

• Given such vectors a, b, …, we can compute the

vector dot product and cosine of the angle between

them.

• The angle is a measure of alignment between 2

vectors and hence similarity.

• An example of its use in information retrieval is to:

Vectorize both the query string and the documents and

find similarity(q, di) for all from 1 to n.

Word2Vec Vector for “Sweden”

What is Word Embedding? Classical Example to show, How vector can help computer understand semantic meanings between words of a

language.

WHAT IS LANGUAGE

MODELING IN NLP?

• A language model is a probabilistic model that assigns probabilities to any sequence of words p(w1, ...

,wT)

• Language modeling is the task of learning a language model that assigns high probabilities to well

formed sentences

• Plays a crucial role in speech recognition and machine translation systems

• There are three Types of Language Modelling

• Linear Language Modelling – Addressed by finding probability of a word appearing in corpus

• Statistical Language Modelling – Addressed by finding probability of a word in sequence/presence

of other words.

• Neural Language Modelling – Addressed by understanding the context of word in its neighbor?

• Recursive Language Modelling – Addressed by understanding the sequence of words appearing

one after another. .

What is Language Modeling?

• An n-gram is a sequence of n words

• unigrams(n=1):’‘is’’,‘‘a’’,‘‘sequence’’,etc.

• bigrams(n=2): [‘‘is’’,‘‘a’’], [‘’a’’,‘‘sequence’’],etc.

• trigrams(n=3): [‘’is’’,‘‘a’’,‘‘sequence’’], [‘‘a’’,‘‘sequence’’,‘‘of’’], etc.

• n-gram models estimate the conditional from n-grams counts

What is Linear Language Modelling? (N-Gram)

What is Statistical Language Modelling? • Problem:

• How can we handle co-occurrence of language in our

models?

• Solution

• Using probabilistic modeling any co-occurrence of word

can be modelled.

• A language model is a probabilistic model that assigns

probabilities to any sequence of words p(w1, ... ,wT)

• Language modeling is the task of learning a language

model that assigns high probabilities to well formed

sentences

• Plays a crucial role in speech recognition and machine

translation systems

• Language models define probability distributions over

(natural language) strings or sentences

• Joint and Conditional Probability

• Problem:

• How can we handle context of language in our models?

• Solution:

• Can theoretically (given enough units) approximate “any” function and fit to “any”

kind of data.

• Efficient for NLP: hidden layers can be used as word lookup tables

• Dense distributed word vectors + efficient NN training algorithms: Can scale to

billions of words !

Neural Language Modelling

• Problem

• How do we handle the compositionality of language in

our models?

• Solution:

• Recursion: the same operator (same parameters) is

applied repeatedly on different components. Also

called Recurrent Neural Networks (RNN).

What is Recursive Language Modelling?

Recursive Neural Networks (RNN)

Thank You

Saurabh Kaushik

Engineering Intelligent NLP Applications Using Deep Learning – Part 1

Engineering

Transcript of Engineering Intelligent NLP Applications Using Deep Learning – Part 1