ASR, NLU, and chatbots · ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing...
Transcript of ASR, NLU, and chatbots · ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing...
www.nr.no
ASR, NLU, and chatbots
Pierre Lison
IN4080: Natural Language Processing (Fall 2019)
07.11.2019
Plan for today► Automatic Speech Recognition (ASR)
► Natural Language Understanding (NLU)
► Chatbot models
2
Plan for today► Automatic Speech Recognition (ASR)► Natural Language Understanding (NLU)
► Chatbot models
3
Speech recognition
4User
Interpreteddialogue act
Dialogue management
Recognitionhypotheses
Language understanding
input speech signal(user utterance)
Speech recognition
Utterance to synthesise
Speech synthesis
Intendedresponse
Generation
output speech signal(machine utterance)
A difficult problem!
5
The speech chain
6
[Denes and Pinson (1993), «The speech chain»]
Speech production
7
►Sounds are variations in air pressure►How are they produced?
▪ An air supply: the lungs (we usually speak by breathing out)
▪ A sound source setting the air in motion (e.g. vibrating) in ways relevant to speech production: the larynx, in which the vocal folds are located
▪ A set of 3 filters modulating the sound: the pharynx, the oral tract (teeth, tongue, palate,lips, etc.) & the nasal tract
Speech production
8
Visualisation of the vocal tract via magnetic resonance imaging [MRI]:
NB: A few languages also rely on sounds not produced by vibration of vocal folds, such as click languages (e.g. Khoisan family in south-east Africa):
Speech perception
9
zoom on the part between 1.126 and 1.157 s.
About 4 cycles in the waveform, which means a frequency of about 4/0.03 ≈129 Hz
A (speech) sound is a variation of air pressure
▪ This variation originates from the speaker’s speech organs
▪ We can plot a wave showing the changes in air pressure over time (zero value being the normal air pressure)
Important measures
10
1. The fundamental frequency F0: lowest frequency of the sound wave, corresponding to the speed of vibration of the vocal folds (between 85-180 Hz for male voices and 165-255 Hz for female voices)
2. The intensity: the signal power normalised to the human auditory threshold, measured in dB(decibels):
for a sample of N time points t1,... tNP0 is the human auditory threshold, = 2 x 10-5 Pa
Note: dB scale is logarithmic, not linear!
Why are F0 and the intensity important?
11
F0 correlates with the pitch of the voice, and the pitch movement for an utterance will give us its intonation
"The ball is red" "Is the ball red?"
Interrogative utterance = rising intonation at the end
12
The signal intensity corresponds to the loudness of the speech sound
low intensity high intensity
F0 correlates with the pitch of the voice, and the pitch movement for an utterance will give us its intonation
Why are F0 and the intensity important?
Speech recognition task► Given a sequence of acoustic observations
(e.g. every 10 milliseconds)
► We wish to determine the sequence of words
13
O = o1, o2, o3, ..., om
W = w1, w2, w3, ..., wn
Speech recognition task
14
• O is a sequence of acoustic observations (e.g. every 10 milliseconds)
• W is a (hidden) sequence of symbols
► Goal: Map speech signal into sequence of linguistic symbols (words or characters):
► Many sources of variation: speaker voice (and style), accents, ambient noise, etc.
Classical model
15
Using Bayes' rule, we can rewrite Ŵ as:
(Bayes)
(P(O) constant for all W)
Determines the probability of the word sequence W
Language modelAcoustic model
Determines the probability of the acoustic inputs O given the word sequence W
Modern neural models► The best performing ASR are deep, end-
to-end neural architectures▪ Less dependent on external ressources
(such as pronunciation dictionaries)▪ No need to preprocess acoustic inputs
► Too complex / time demanding to review them in this course▪ But they rely on the same building blocks as other
NNs: convolutions, recurrence, (self-)attention, etc.
16
17[Figure from Bhuvana Ramabhadran’s presentation at Interspeech 2018]
ASR Performance
ASR evaluation
18
► Standard evaluation metric: Word Error Rate▪ Measures how much the utterance hypothesis h differs
from the «gold standard» transcription t*
► Relies on a minimum edit distance between hand t*, counting the number of word substitutions, insertions and deletions:
ASR evaluation
19
Gold standard Transcription
yes can you now rotate this triangle
ASR hypothesis yes can you not rotate this triangle there
Gold standard Transcription there is five and
ASR hypothesis the size and
1 Sub + 1 Ins 2 Sub + 1 Del
Disfluencies
20
►Speakers construct their utterances «as they go», incrementally▪ Production leaves a trace in the speech stream
►Presence of multiple disfluencies▪ Pauses, fillers («øh», «um», «liksom»)
▪ Fragments
▪ repetitions («the the ball»), corrections («the ball err mug»), repairs («the bu/ ball»)
Disfluencies
21
Internal structure of a disfluency:
► reparandum: part of the utterance which is edited out
► interregnum: (optional) filler
► repair: part meant to replace the reparandum
[Shriberg (1994), «Preliminaries to a Theory of Speech Disfluencies», Ph.D thesis]
Disfluencies
22
► Repetitions
► Corrections:
► Rephrasing/completion:
More complex disfluencies
23
så gikk jeg e flytta vi til Nesøya da begynte jeg på barneskolen der og så har jeg gått på Landøya ungdomsskole # som ligger ## rett over broa nesten # rett med Holmen
jeg gikk på Bryn e skole som lå rett ved der vi bodde den gangen e barneskolevidere på Hauger ungdomsskole
da hadde alle hele på skolen skulle liksom # spise julegrøt og det va- det var bare en mandel og da var jeg som fikk den da ble skikkelig sånn " wow # jeg har fått den " ble så glad
[«Norske talespråkskorpus - Oslo delen» (NoTa), collected and annotated by the Tekstlaboratoriet]
Disfluency detection► We can build a neural network to
automatically detect disfluencies▪ Sequence labelling task: determine for
each token whether it is disfluent or not
► Open question: what should we do with the result? Edit out the disfluencies?
24
[Paria Jamshid Lou et al (2018), Disfluency detection using auto-correlational neural networks, EMNLP]
Plan for today► Automatic Speech Recognition (ASR)
► Natural Language Understanding (NLU)► Chatbot models
25
NLU► The goal of NLU is to determine the
content (or intent) of an utterance
► The output may be:▪ A categorical label (or a set of labels)▪ A list of recognised slots
26
«Show me morning flights from Boston to San Francisco on Tuesday»
Intent classification► Can be framed as a text classification task
▪ Requires dialogue data annotated with intents▪ Categories may be derived from domain-independent
taxonomies (e.g. dialogue acts)▪ … Or ad-hoc taxonomies for your domain
► Pick up your favourite machine learning model
27How are you ?
LSTM LSTM LSTM LSTMsoftmax
Output label
Intent classification► How to take the context into account?
▪ Simple approach: prepend the previous dialogue history (up to a limit) to the input
▪ More advanced: view intent classification as a sequence labelling task (at utterance level) and find the most likely sequence of intents for a given dialogue
28
Hi Pierre! <turn> Hi Alex! <turn> How are you?
Slot filling► Goal: find slots with a user-provided
value in the utterance
► The slots are domain-specific▪ And so are the ontologies listing all
possible values for each slot29
«Show me morning flights from Boston to San Francisco on Tuesday»
Slot fillingPopular approach: use (again) a sequence-labelling approach, for instance BIO
30[illustration from D. Jurafsky]
Reference resolution► Another important NLU task (especially in
situated systems) is reference resolution▪ i.e. «Pick up the box on your left»
31
► Reference resolution is the process of finding which entities are referred to by specific linguistic expressions
Reference resolution
32
Some terminology:► A linguistic expression used to perform reference
is called a referring expression
► The entity that is referred to is called the referent
Pierre(referential ambiguity)
The IN4080 teacher
(coreference)
Reference resolution
33
►Reference resolution usually rely on a discourse model containing the set of entities that can be referred to ▪ As well as their relationships with one another
▪ The discourse model continuously change during the interaction (entities come and go, become more or less focused, etc.)
► In situated systems, the discourse model also contain objects or events in the shared environment
Reference resolution
34
► Various features can be used to resolve references:▪ Grammatical agreement (number, person, gender)▪ Saliency (recency of mention, visual salience, etc.)▪ Semantic constraints
► Based on these features and annotated training data, one can then train a classifier▪ Binary classification: given a referring expression A and a
referent B the classifier determines whether A refers to B
▪ Any supervised learning algorithm will do
Final remarks on NLU► How to extract meaning out of each utterance is
often highly domain-dependent▪ Which information is relevant for your system?▪ What kind of data do you have?
► Rule-based approaches are still quite popular▪ Or hybrid approaches combining rules with ML
► Alternatively: use neural models to generate utterance-level embeddings instead of predefined categories (intents, slots, etc.)
35
Plan for today► Automatic Speech Recognition (ASR)
► Natural Language Understanding (NLU)
► Chatbot models
36
Chatbots: main approaches▪ Rule-based models
▪ Corpus-based models:◦ Information-Retrieval models◦ Sequence-to-Sequence models
37
Pro: Fine-grained control on interaction
Con: Difficult to build, scale and mantain
Pro: Better coverage & robustness
Con: Need training data!
Rule-based models► Pattern-action rules
► For instance:
38[example from D. Jurafsky]
IR models► Alternatively, one can adopt a data-driven
approach and learn how to respond to the user based on a dialogue corpus
► Key idea:▪ Given a user input q, find the utterance t in the
dialogue corpus that is most similar to q▪ Then return as response the utterance r
following t in the corpus
39
IR models
► How to determine which utterance is «most similar» to the actual user utterance?▪ Cosine similarity over some vectors▪ The vectors can be TF-IDF weighted words▪ Or utterance-level embeddings
40
Dual encoders
41
• Training data: (input, response) pairs associated with 1 (correct response) or 0 (wrong response)
• At runtime, search response with max output score
• Dot product between input embeddings and response embeddings (after linear transform)
• Followed by a sigmoid to get a score in [0,1]
[Lison & Bibauw (2017), «Not all dialogues are created equal: Instance Weighting for neural conversational models» (SIGDIAL)]
Seq2seq models► Sequence-to-sequence models generate a
response token-by-token▪ Akin to machine translation▪ Advantage: can generate «creative»
responses not observed in the corpus
► Two steps:▪ First «encode» the input with e.g. an LSTM▪ Then «decode» the output token-by-token
42
Seq2seq models
43[Image borrowed from Deep Learning for Chatbots: Part 1]
NB: state-of-the-art seq2seq models use an attention mechanism (not shown here) above the recurrent layer
Seq2seq models► Interesting models for dialogue research
► But:▪ Difficult to «control» (hard to know in advance
what the system may generate)▪ Lack of diversity in the responses (often stick to
generic answers: «I don’t know» etc.)▪ Getting a seq2seq model that works reasonably
well takes a lot of time (and data)
44
[Li, Jiwei, et al. (2015) "A diversity-promoting objective function for neural conversation models.», ACL]
Plan for today► Automatic Speech Recognition (ASR)
► Natural Language Understanding (NLU)
► Chatbot models
► Short recap
45
Summary
► Deep NNs have boosted ASR performance▪ But not yet a «solved problem»▪ (especially for ressource-poor languages and
non-standard voices/acoustic environments) ▪ Word Error Rate metric used for evaluation
► Disfluencies abound in spoken language
46
Acoustic observations O = o1, o2, o3, ..., om
Recognition hypothesis W = w1, w2, w3, ..., wn
ASR:
Summary► Natural language understanding (NLU) is
an umbrella term for models designed to extract «content» from the utterances▪ Intent recognition▪ Slot filling▪ Reference resolution
► Rule-based or ML-based approaches▪ Or a combination of both!
47
Summary► Chatbots can be either
▪ Handcrafted with pattern response rules▪ Data-driven, based on a dialogue corpus
48
IR models: find the «best» response among the ones in the corpus
Seq2seq models: generate a response token by token