Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Post on 22-Jan-2018

339 views 3 download

Transcript of Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Getting Value Out of Chat DataWHAT TO DO WHEN YOUR DATA IS NOISY, SPARSE, AND SHORT

0

Introduction

Contact: daniel@talla.com

1

Talla

NLP for internal business use cases

Smart knowledge management

Hiring!

2

What is “Chat data?”

USER2: USER3 do you have new new cal on your Talla account already? Looks like it’s not available for me yet. Would be nice if we could also get inbox support enabled since it’s so much better than gmail. cc USER1USER3: USER2 I realized that after I typed this that I was using my personal gmail when I updated to the new changes. I looked on Talla and I didn’t see the same option to update to new calendar yet.USER4: USER2 I just enabled Inbox for our domainUSER4: new calendar is set to letting google decide when to roll it out, but it looks like we can also enable it as an option nowUSER4: I've now set that to be available as well. These may take some time to show upUSER1: USER2 its been enabled for awhile.USER1: (inbox)USER1: and the new calendar is enabled, soon as google decides you are allowed to have it.USER2: Thanks USER1 USER4

3

Things similar to chat data

Sequential interactions

Forum posts

Some email

IT ticketing system interactions

Short text

Associated with a user

Possibly directed at another user

Highly context dependent

4

Problems with chat

Increasing number of data sources

In theory contains lots of valuable information

In practice data is unlabeled

“Water, water, everywhere, but not a drop to drink.”

5

Goal: Issue detection and matching

People get help through chat platforms

Extract that data and automate the process

USER1’s interaction should help USER3!

USER1: Hi, does anyone know if we have patriot’s day off?USER2: Yeah USER1, we do.USER1: Thanks! …USER3: Hey, do we get patriot’s day off?

6

Automating knowledge delivery

Find issues or questions that people have

Match new issues to pre-existing ones

Serve the appropriate response or answer

Extracting answers is very hard

Focus on matching and search

7

Overview

Jumpstart ML: Active Learning

Topic modeling

Dimensionality Reduction and Representations

8

Find questions and analyze

Use patterns to find questions

Has ‘?’ token

Has a question word

Not too hard

Good start for finding past issues

9

Problems with extracted questions

Most questions need context to understand. e.g.:

“What is it?”

”Can I use her personal email?”

Intent varies:

Want information

Do this thing for me

Huh?

10

Only some questions make sense out of context

“Who is she?” “What is that?” “Will that fix my computer?”

Anaphora—it, that

Pronouns—He, she, etc

“What day is it?”, “Where am I?”

Answer depends on time, person asking

Requires more involved data model

11

Questions have different intents

“Performative” – Please help me? ex:

hi can you please help me reset my 2 factor authentication on salesforce?

“Informational” – What is it?

what's the pl code?

“Navigational” – How do I do this?

how do i record a vidyo meeting?

12

Can we write special case rules?

Borderline cases

is there a way to find out the size of an hbase table? – User asks “Is there (a way…)” to get directions

can anyone tell me where i find the out of stock request report? –User asks someone to give them information

Many variants

Alternative is to label data and use supervised learning

13

We want to label data, but…

Managing crowdworkers:

Expensive

Time consuming

Can’t be used unless data is safely anonymous

Will the model work afterwards?

14

Active Learning makes labeling more efficient

More value for your time

Can use with crowd workers or without

Good for chat:

Models train fast

Quick to annotate

Supervised learning with little labeled data

Annotate

Train/Predict Get data

15

How it works (roughly)

Annotate 𝐷0 ∈ 𝐷

Train your model on 𝐷0

Predict labels on remaining data (𝐷 − 𝐷0)

Choose more data, 𝐷1 ∈ 𝐷 − 𝐷0,

Choice of 𝐷1 is based on label predictions

Repeat

???

Profit!

Annotate

Train/Predict Get data

16

Where we are

Jumpstart ML: Active Learning

Topic modeling

Dimensionality Reduction and Representations

17

More to data than questions or intent

What do people talk about?

What kind of issues are common?

Are there clear lines defining topics?

Finding problem areas

Strategic thinking about what to tackle

18

Know Your Data

Read some of it (if you can)

Learn the context

Cluster and overview

19

Clustering or modeling chat topics

LDA, LSA, NMF, others

Human supervision necessary for interpretation(boo!)

Messages short, so chat is hard

Larger documents have broader topic distributions

We expect messages to be about fewer topics

20

Using LDA with Chat

𝜶 =. 𝟓 𝜶 =. 𝟏 𝜶 =. 𝟎𝟓 𝜶 = . 𝟎𝟑

know; does; link database; jermaine; running file; area; bank free; jermaine; database

did; try; work online; palace; sorry mean; try; screen user; hi; email

send; test; agent try; user; free did; ok; want client; server; user

look; able; mean user; client; error error; server; user ok; did; update

online; help; screen mean; app; does whats; agent; end mean; user; file

hi; palace; property shall; working; process client; property; user online; user; change

email; error; just emails; kelly; time online; user; update mandy; wrong; chance

user; issue; want did; ok; property palace; live; test owner; end; invoice

client; need; check ticket; whats; right run; right; check want; error; agent

owner; report; password check; chloe; duncan emails; know; link live; palace; try

21

Where we are

Jumpstart ML: Active Learning

Topic modeling

Dimensionality Reduction and Representations

22

Why do dimensionality reduction?

We want to improve our supervised learning techniques

Chat data is even more sparse than many NL datasets

Good representations can help search and similarity models

Off the shelf representations are good

Off the shelf + custom representations are better

23

Setting up methods for learning

Word2vec, NMF, even LDA

Most methods equivalent*

Chat has no clear document barriers

Methods assume either continuous context or separate documents

Using messages as contexts too sparse

24

Choosing a context

Representations are influenced by context choice

Figure out your goal

Choose context where words are associated in a way helpful for your goal

For our purposes: Words should be similar if they occur together in issues people have

25

Using a time-based context window

Window before each question

Problem statement and questions should be related

USER2: Can I email this form, or do I have to print it out?USER1: You need to drop the form off in personUSER2: OK, sure. USER1: Great.USER2: Where can I get access to the printers? …

26

Keywords are extracted from recent history

USER2: Can I email this form, or do I have to print it out?USER1: You need to drop the form off in personUSER2: OK, sure.USER1: Great.USER2: Where can I get access to the printers?…

27

Similarity from resulting representations

‘printer’

['printer', 'choice', 'fuji', 'xerox', 'settings', 'sequence', 'default', 'rollover', 'driver', 'takes', 'smaller', 'main', ]

‘issue’

['issue', 'resolved', 'helping', 'experiencing', 'companies', 'related', 'assuming', 'reported', 'double', 'site', 'saw', 'causing', 'understand', 'sorted', 'logging', 'heard’]

‘ssh’

['ssh', 'config', 'dhcp, 'ping', 'reconnect', 'jpg’, 'webconsole', 'coats', 'lab’, 'browsers', 'instances', 'bypass’]

28

Final Thoughts...

Tip of the iceberg

Understand how people interact

What information can we extract?

Can we escape our corpus?

29

Thank you everyone!

thanks

['heaps', 'great', 'perfect', 'fantastic',]

30