1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The...

28
1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following: ALICE and Elizabeth chatbot systems. Examples of the Dialogue Diversity Corpus and its problems. A Java program to convert from dialogue transcript to AIML Format. Using Wmatrix to compare human and chatbot dialogue.
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    1

Transcript of 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The...

Page 1: 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following:  ALICE and Elizabeth.

1

 

Using Dialog Corpora to train a Chatbot

Bayan Abu Shawar and Eric Atwell, University of LEEDS 

The paper presents the following:

ALICE and Elizabeth chatbot systems.

Examples of the Dialogue Diversity Corpus and its problems.

A Java program to convert from dialogue transcript to AIML Format.

Using Wmatrix to compare human and chatbot dialogue.

Page 2: 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following:  ALICE and Elizabeth.

2

A Chatbot

A chatbot is a conversational agent that interacts with users using

natural language.

ALICE and Elizabeth chatbots are presented in this paper.

Both were adopted from ELIZA (Weizenbaum 1966), which

emulated a psychotherapist.

Page 3: 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following:  ALICE and Elizabeth.

3

ALICE System

 

ALICE: the Artificial Linguistic Internet Computer Entity; a software robot that you can chat with using natural language.

ALICE language knowledge is stored in AIML files.

AIML: The Artificial Intelligence Mark up Language.

Page 4: 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following:  ALICE and Elizabeth.

4

AIML Files are made up of :

Topics : each Topic file contains a list of categories

Categories: contain

Pattern: to match with user input

Template: represents ALICE output

Patterns can match parts of input: “divide and conquer”

Page 5: 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following:  ALICE and Elizabeth.

5

The AIML Format

< aiml version=”1.0” >

< topic name=” the topic” >

<category>

<pattern>PATTERN</pattern>

<template>Template</template>

</category>

..

</topic>

</aiml>

Page 6: 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following:  ALICE and Elizabeth.

6

Example involving <srai> - recursion:

Halo, what is 2 and 2 ?

HALO WHAT IS 2 AND 2

<sr/> <srai>WHAT IS 2 AND 2 </srai>

HALO WHAT IS 2 AND *

HELLO WHAT IS 2 *

Well hello there!

Hi. I was waiting to talk

Hello there!

Two

Four

Six

Hello there! Four

Page 7: 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following:  ALICE and Elizabeth.

7

Elizabeth system (Millican 2002)

Knowledge is stored as a script in a text file.

Each line starts with a script command notation.

These notations are:

W: Welcome message Q: quitting message

V: Void input I: Input transformation

K: Key word pattern R: key word response

N: No match O: Output transformation

M: Memorise phrase &: Action to be perform

/ : Comment

Page 8: 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following:  ALICE and Elizabeth.

8

Pattern Matching is more complex in Elizabeth

The matching process involves five phases:

    1. Matching with Input Transformation Rules.

2. Matching with Keyword patterns.

3. Matching with Output transformation rules.

4. Matching with Void or No keyword messages.

5. Performing any Dynamic processes.

… ALICE categories are simpler, easier to Machine Learn, but we can also convert from AIML format to Elizabeth script.

Page 9: 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following:  ALICE and Elizabeth.

9

Example:

Input: I think my mum loves my brother more than me

Match Algorithm:

1. I think my mother loves my brother more than me

2. WHY DO YOU THINK [my mother loves my brother more than me]?

3. WHY DO YOU THINK YOUR MOTHER LOVES YOUR BROTHER MORE THAN YOU?

Respond: WHY DO YOU THINK YOUR MOTHER LOVES YOUR BROTHER MORE THAN YOU?

Page 10: 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following:  ALICE and Elizabeth.

10

Machine Learning from the Dialog Diversity Corpus

The DDC is a collection of links to different dialogue corpuses in different fields.

Examples of these dialogue corpuses are:

MICAS Corpus

CIRCLE Corpus

CSPA Corpus

The TRAINS Dialogue Corpus

ICE-Singapore Corpus

Mishler Book Medical Interview

Page 11: 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following:  ALICE and Elizabeth.

11

MICAS Corpus

Michigan of academic Spoken English, a collection of transcripts of academic speech events recorded at the University of Michigan.

Astronomy transcript:

S1: circumpolar stars. So if I keep my pointer there, [S2: oh ] <ROTATES CEILING> everything else moves and we all get sick. <SS LAUGH> and we go backwards in time. And that’s even more fun. S2: make it go really really fast.

Problems:

Long monologs

Overlapping

More than two speakers

Extra annotations recorded actions such as <SS Laugh>

Page 12: 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following:  ALICE and Elizabeth.

12

Circle Corpus

Centre for interdisciplinary research on constructive learning environments

A collection of transcripts holding different tutorial sessions on topics such as physics, algebra and geometry.

Algebra transcript

TUTOR [ Opening remarks and asks student to read out aloud

and begin]

STUD [Reads problem] Mike starts a job at McDonald’s that will pay him 5 dollars and hour, Mike gets dropped off by his parents at the start of is shift. Mike works a “h” hour shift. Write an expression for how much he makes in one night?

[Writes “h*5 = how much he makes”]

Page 13: 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following:  ALICE and Elizabeth.

13

Physics transcripts

T: [student name], I’d like you to read the problem carefully, and then tell me your strategy for solving this.

S: ok [Pause 17 sec] hmm.

[Pause 6 sec] T: thinking out loud as much as possible is good

Problem:

Different format structure were used to distinguish speakers and linguistic annotation

Page 14: 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following:  ALICE and Elizabeth.

14

CSPA Corpus

Corpus of Spoken Professional American-EnglishIncludes transcripts conversations of various types.

LANGER: Hello, I’m delighted to be here.I have carefully read and heard about the University of Albany, the State University of New York. And I’m also the director of the National Research Center on English Learning and Achievement.

STRICKLAND: Her mother wrote the stances.(Laughter)

Problems:Long turn monologues.

The transcript were not “anonymised”.

Page 15: 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following:  ALICE and Elizabeth.

15

The TRAINS Dialogue Corpus

A corpus of task-oriented spoken dialogue, that has been used in several studies of human-human dialogue.

utt10 : what you'll have to do is you'll have to uh pick out an <sli> uh an engine <sli> and schedule a train to do thatutt11 : u: okay <sli> um <sli> engine <sli> two utt12 : s: + okay + utt13 : u: + from + Elmira utt14 : s: + mm-hm +

Problem

Dealing with extra linguistic annotation such as ‘+’ and <sli>

Page 16: 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following:  ALICE and Elizabeth.

16

ICE-Singapore

International Corpus of English, Singapore English

<$B><ICE-SIN: S1A-099#33:1:B>How how are things otherwise<ICE-SIN:S1A-099#34:1:B>Are you okay<$A><ICE-SIN:S1A-099#35:1:A>Uhm okay lah

Problems

Unconstrained conversations A lot of linguistic annotation Great variation in turn length

Page 17: 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following:  ALICE and Elizabeth.

17

Mishler Book Medical Interviews

A scanned text image, including dialogue between patient and physician.

Problems

Scanned image cannot be converted to text format

Extra linguistic annotation

Page 18: 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following:  ALICE and Elizabeth.

18

Desired dialogue corpus characteristics for machine learning

We developed a Java program to read a transcript from the DDC and convert it to AIML format in order to retrain ALICE.

Problems arise when extracting ALICE categories from the DDC:

No standard formats to distinguish between speakers. Extra-linguistic annotations were used. No standard format in using linguistic annotations. Long turns and monologues. Irregular turn taking (overlapping). More than one speaker. Scanned text-image not converted to text format.

Page 19: 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following:  ALICE and Elizabeth.

19

To extract AIML, corpus data must be “normalized” to make itlook like chatbot transcripts:

1. Two speakers.

2. Structured format.

3. Short, obvious turns without overlapping, and without any

unnecessary notes, extras-linguistic expressions etc.

Page 20: 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following:  ALICE and Elizabeth.

20

The Java Program

Converts the dialogue transcript to AIML format.

The output AIML is used to retrain ALICE.

The first speaker is the pattern, the second is the template.

Page 21: 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following:  ALICE and Elizabeth.

21

Example from the MICAS corpus:

S1: circumpolar stars. So if I keep my pointer there, [S2: oh ] <ROTATES CEILING> everything else moves and we all get sick. <SS LAUGH> and we go backwards in time. And that’s even more fun.S2: make it go really really fast.

The AIML category generated by the program is:

<category><pattern> CIRCUMPOLAR STARS SO IF I KEEP MY POINTER THERE EVERYTHING ELSE MOVES AND WE ALL GET SICK AND WE GO BACKWARDS IN TIME AND THAT’S EVEN MORE FUN</pattern><template> make it go really really fast.</template></category>

Page 22: 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following:  ALICE and Elizabeth.

22

Other differences we need to “learn” :Using Wmatrix to compare human and chatbot dialogue

Wmatrix is a tool to provide a data driven method to compare corpora, three levels: Word, PoS and semantic tag analysis.

The comparisons results are viewed as frequency lists ordered by log-likelihood ratio (LL).

LL values indicate the most important differences between corpora.

Wmatrix was used to compare human-to-human dialogues extracted from the DDC corpora and human to computer dialogues extracted from chatting with ALICE.

Page 23: 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following:  ALICE and Elizabeth.

23

ALICE and Astronomy Word Comparison

Sorted by log-likelihood value

Item O1 %1 O2 %2 LL

do 44 3.90 35 0.65 + 58.69

i 54 4.79 67 1.25 + 48.04

we 1 0.09 129 2.41 - 41.15

so 1 0.09 117 2.19 - 36.75

and 8 0.71 195 3.65 - 35.19

Emily 9 0.80 0 0.00 + 31.46

you 72 6.38 151 2.82 + 28.91

this 0 0.00 70 1.31 - 26.80

Page 24: 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following:  ALICE and Elizabeth.

24

ALICE and Astronomy POS Comparison

Sorted by log-likelihood value

Item O1 %1 O2 %2 LL

PPIS1 55 4.88 0 0.00 + 192.23

VD0 43 3.81 27 0.50 + 67.27

PPIS2 1 0.09 129 2.41 - 41.15

CC 10 0.89 230 4.30 - 39.86

PPY 80 7.09 155 2.90 + 37.52

CS 4 0.35 116 2.17 - 23.31

ZZ1 0 0.00 56 1.05 - 21.44

DD1 9 0.80 151 2.82 - 19.97

Page 25: 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following:  ALICE and Elizabeth.

25

ALICE and Astronomy Semantic Comparison

Sorted by log-likelihood value

Item O1 %1 O2 %2 LL

Z1 34 3.01 22 0.41 + 52.21 Personal names

E2+ 16 1.42 6 0.11 + 32.44 Liking

W1 2 0.18 102 1.91 - 26.27 The universe

M6 7 0.62 151 2.82 - 24.95 Location and direction

M1 0 0.00 59 1.10 - 22.59 Moving, coming

H4 8 0.71 1 0.02 + 22.06 Residence

F1 6 0.53 0 0.00 + 20.97 Food

Q2.1 25 2.22 35 0.65 + 19.27 Speech act

Page 26: 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following:  ALICE and Elizabeth.

26

French word Comparison between Chatbot and real dialogue

Sorted by log-likelihood value

Item O1 %1 O2 %2 LL

conversation 3 0.01 6 1.01 - 33.18

euh 662 2.80 0 0.00 + 32.91

danser 0 0.00 4 0.67 - 29.66

fais 0 0.00 4 0.67 - 29.66

de 463 1.96 35 5.88 - 29.16

coucher 0 0.00 3 0.50 - 22.24

football 0 0.00 3 0.50 - 22.24

Page 27: 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following:  ALICE and Elizabeth.

27

Conclusions

1. We train ALICE rather than Elizabeth because AIML format is closer to the markup language and the simple pattern matching technique used by ALICE.

2. Dialogue Diversity corpus (DDC) illustrates huge diversity in dialogues: genres, speaker background/register, mark-up and annotation.

3. It will be useful to agree standards for transcription and mark-up format.

4. Wmatrix has shown further differences between chatbot and real dialogue.

Page 28: 1 Using Dialog Corpora to train a Chatbot Bayan Abu Shawar and Eric Atwell, University of LEEDS The paper presents the following:  ALICE and Elizabeth.

28

Future Work

Expanding AIML files using least frequent word and

investigating how to incorporate corpus-derived linguistic

annotation into an Elizabeth-style chatbot pattern file.