Using Machine Learning to Monitor Collaborative Interactions
description
Transcript of Using Machine Learning to Monitor Collaborative Interactions
Using Machine Learning to Monitor Collaborative Interactions
Carolyn Penstein Rosé
Language Technologies Institute/ Human-Computer Interaction
Institute
VMT-Basilica (Kumar & Rosé, 2010)
Download tools at:http://www.cs.cmu.edu/~cprose/TagHelper.htmlhttp://www.cs.cmu.edu/~cprose/SIDE.html
Monitoring Collaboration with Machine Learning Technology
TagHelper
Labeled Texts
Unlabeled Texts
Labeled Texts
A Model that can Label More Texts
Time
Beh
avio
r
<Triggered Intervention>
TagHelper Tools and SIDE
TagHelper Tools uses text miningtechnology to automate annotationof conversational data SIDE facilitates rapid prototyping of reporting
interfaces for group learning facilitators
Define Summaries
Annotate Data
Visualize Annotated Data
http://www.cs.cmu.edu/~cprose/TagHelper.htmlhttp://www.cs.cmu.edu/~cprose/SIDE.html
Important caveat!! Machine learning isn’t magic But it can be useful for
identifying meaningful patterns in your data when used properly
Proper use requires insight into your data
?
Naïve Approach: When all you have is a hammer…
TargetRepresentationData
Naïve Approach: When all you have is a hammer…
TargetRepresentation
Problem: there isn’t one universally best approach!!!!!
Data
Slightly less naïve approach: Aimless wandering…
TargetRepresentationData
Slightly less naïve approach: Aimless wandering…
TargetRepresentation
Problem 1: It takes too long!!!
Data
Slightly less naïve approach: Aimless wandering…
TargetRepresentation
Problem 2: You might not realize all of the options that are available to you!
Data
Expert Approach: Hypothesis driven
TargetRepresentationData
Expert Approach: Hypothesis driven
TargetRepresentation
You might end up with the same solution in the end, but you’ll get there faster.
Data
Expert Approach: Hypothesis driven
TargetRepresentation
Today we’ll start to learn how!
Data
What is machine learning?
Automatically or semi-automatically Inducing concepts (i.e., rules) from dataFinding patterns in dataExplaining dataMaking predictions
Data Learning Algorithm Model
New Data
PredictionClassification Engine
How does machine learning work?
The simplest rule learner willlearn to predict whatever isthe most frequent result class.This is called the majorityClass.
What will the rule be in this case?
It will always predict yes.
A slightly more sophisticated rule learner will find the feature that gives the mostinformation about the result class. Whatdo you think that would be in this case?
Outlook:Sunny -> NoOvercast -> YesRainy-> Yes
<Feature Name>:<value> -> <prediction><value> -> <prediction>…
What will be the prediction?
Outlook:Sunny -> NoOvercast -> YesRainy-> Yes
Model
New Data
Yes
More Complex Algorithm… Two simple algorithms
last time0R – Predict the
majority class1R – Use the most
predictive single feature Today – Intro to
Decision TreesToday we will stay at a
high levelWe’ll investigate more
details of the algorithm next time
* Only makes 2 mistakes!
More Complex Algorithm… Two simple algorithms
last time0R – Predict the
majority class1R – Use the most
predictive single feature Today – Intro to
Decision TreesToday we will stay at a
high levelWe’ll investigate more
details of the algorithm next time
* Only makes 2 mistakes!
More Complex Algorithm… Two simple algorithms
last time0R – Predict the
majority class1R – Use the most
predictive single feature Today – Intro to
Decision TreesToday we will stay at a
high levelWe’ll investigate more
details of the algorithm next time
* Only makes 2 mistakes!
What will it do with this example?
More Complex Algorithm… Two simple algorithms
last time0R – Predict the
majority class1R – Use the most
predictive single feature Today – Intro to
Decision TreesToday we will stay at a
high levelWe’ll investigate more
details of the algorithm next time
* Only makes 2 mistakes!
What will it do with this example?
More Complex Algorithm… Two simple algorithms
last time0R – Predict the
majority class1R – Use the most
predictive single feature Today – Intro to
Decision TreesToday we will stay at a
high levelWe’ll investigate more
details of the algorithm next time
* Only makes 2 mistakes!
What will it do with this example?
More Complex Algorithm… Two simple algorithms
last time0R – Predict the
majority class1R – Use the most
predictive single feature Today – Intro to
Decision TreesToday we will stay at a
high levelWe’ll investigate more
details of the algorithm next time
* Only makes 2 mistakes!
What will it do with this example?
Why is it better?
Not because it is more complexSometimes more complexity makes
performance worse What is different in what the three rule
representations assume about your data?0R1RTrees
The best algorithm for your data will give you exactly the power you need
Why is it better?
Not because it is more complexSometimes more complexity makes
performance worse What is different in what the three rule
representations assume about your data?0R1RTrees
The best algorithm for your data will give you exactly the power you need
Let’s say you know the rule you are trying to learnis a circle and you have these points. What rulewould you learn?
Why is it better?
Not because it is more complexSometimes more complexity makes
performance worse What is different in what the three rule
representations assume about your data?0R1RTrees
The best algorithm for your data will give you exactly the power you need
Let’s say you know the rule you are trying to learnis a circle and you have these points. What rulewould you learn?
Why is it better?
Not because it is more complexSometimes more complexity makes
performance worse What is different in what the three rule
representations assume about your data?0R1RTrees
The best algorithm for your data will give you exactly the power you need
Let’s say you know the rule you are trying to learnis a circle and you have these points. What rulewould you learn?
Now lets say you don’t know the shape, now what would you learn?
Why is it better?
Not because it is more complexSometimes more complexity makes
performance worse What is different in what the three rule
representations assume about your data?0R1RTrees
The best algorithm for your data will give you exactly the power you need
Let’s say you know the rule you are trying to learnis a circle and you have these points. What rulewould you learn?
Now lets say you don’t know the shape, now what would you learn?
Why is it better?
Not because it is more complexSometimes more complexity makes
performance worse What is different in what the three rule
representations assume about your data?0R1RTrees
The best algorithm for your data will give you exactly the power you need
Let’s say you know the rule you are trying to learnis a circle and you have these points. What rulewould you learn?
Now lets say you don’t know the shape, now what would you learn?
If you know the shape, you have fewer degreesof freedom – less room to make a mistake.
Why is it better?
Not because it is more complexSometimes more complexity makes
performance worse What is different in what the three rule
representations assume about your data?0R1RTrees
The best algorithm for your data will give you exactly the power you need
Let’s say you know the rule you are trying to learnis a circle and you have these points. What rulewould you learn?
Now lets say you don’t know the shape, now what would you learn?
If you know the shape, you have fewer degreesof freedom – less room to make a mistake.
Why is it better?
Not because it is more complexSometimes more complexity makes
performance worse What is different in what the three rule
representations assume about your data?0R1RTrees
The best algorithm for your data will give you exactly the power you need
Let’s say you know the rule you are trying to learnis a circle and you have these points. What rulewould you learn?
Now lets say you don’t know the shape, now what would you learn?
If you know the shape, you have fewer degreesof freedom – less room to make a mistake.
Why is it better?
Not because it is more complexSometimes more complexity makes
performance worse What is different in what the three rule
representations assume about your data?0R1RTrees
The best algorithm for your data will give you exactly the power you need
Let’s say you know the rule you are trying to learnis a circle and you have these points. What rulewould you learn?
Now lets say you don’t know the shape, now what would you learn?
If you know the shape, you have fewer degreesof freedom – less room to make a mistake.
Why is it better?
Not because it is more complexSometimes more complexity makes
performance worse What is different in what the three rule
representations assume about your data?0R1RTrees
The best algorithm for your data will give you exactly the power you need
What do concepts look like?
Clarification: Concepts as Lines
R B
S
T
C
X
X
X
X
X
X
Machine Learning Process Overview Get to know your data
What distinguishes messages from different categories
Represent messages in terms of features Use feature table tab
Build machine learning model Use machine learning tab
Learn from mistakes, and try again Use feature analyzer tab
Features Coding
Machine Learning
Algorithms you will use
Decision Trees (J48): good with small feature sets, can find contingencies between features
Naïve Bayes: fast, makes decisions based on probabilities
Support Vector Machines (SMO), makes decisions based on weights, usually works well on text
Setting Up Your Data
How do you know when you have coded enough data?
What distinguishesQuestions and Statements?
Not all questionsend in a questionmark.
Not all WH wordsoccur in questionsI versus you isnot a reliable predictor
You need to codeenough to avoidlearning rules thatwon’t work
Basic Idea
Represent text as a vector where each position corresponds to a term
This is called the “bag of words” approach
Cows make cheese
110001
Hens lay eggs 001110
CheeseCowsEggsHensLayMake
But same representationBut same representationfor “Cheese makes cows.”!for “Cheese makes cows.”!
What can’t you conclude from “bag of words” representations?
Causality: “X caused Y” versus “Y caused X”
Roles and Mood: “Which person ate the food that I prepared this morning and drives the big car in front of my cat” versus “The person, which prepared food that my cat and I ate this morning, drives in front of the big car.” Who’s driving, who’s eating, and who’s preparing
food?
Part of Speech Tagging
1. CC Coordinating conjunction
2. CD Cardinal number 3. DT Determiner 4. EX Existential there 5. FW Foreign word 6. IN Preposition/subord 7. JJ Adjective 8. JJR Adjective,
comparative 9. JJS Adjective, superlative 10.LS List item marker 11.MD Modal
12.NN Noun, singular or mass
13.NNS Noun, plural 14.NNP Proper noun,
singular 15.NNPS Proper noun, plural 16.PDT Predeterminer 17.POS Possessive ending 18.PRP Personal pronoun 19.PP Possessive pronoun 20.RB Adverb 21.RBR Adverb, comparative 22.RBS Adverb, superlative
http://www.ldc.upenn.edu/Catalog/docs/treebank2/cl93.html
Part of Speech Tagging
23.RP Particle
24.SYM Symbol
25.TO to
26.UH Interjection
27.VB Verb, base form
28.VBD Verb, past tense
29.VBG Verb, gerund/present participle
30.VBN Verb, past participle
31.VBP Verb, non-3rd ps. sing. present
32.VBZ Verb, 3rd ps. sing. present
33.WDT wh-determiner
34.WP wh-pronoun
35.WP Possessive wh-pronoun
36.WRB wh-adverb
http://www.ldc.upenn.edu/Catalog/docs/treebank2/cl93.html
Feature Space Design
Feature Space DesignThink like a computer!Machine learning algorithms look for features that
are good predictors, not features that are necessarily meaningful
Look for approximations If you want to find questions, you don’t need to do a
complete syntactic analysis Look for question marks Look for wh-terms that occur immediately before an
auxilliary verb
Feature Space Design
Feature Space DesignPunctuation can be a “stand in” for mood
“you think the answer is 9?” “you think the answer is 9.”
Bigrams capture simple lexical patterns “common denominator” versus “common multiple”
POS bigrams capture syntactic or stylistic information
“the answer which is …” vs “which is the answer”Line length can be a proxy for explanation
depth
Feature Space Design
Feature Space DesignContains non-stop word can be a predictor of
whether a conversational contribution is contentful
“ok sure” versus “the common denominator”Remove stop words removes some distracting
featuresStemming allows some generalization
Multiple, multiply, multiplicationRemoving rare features is a cheap form of
feature selection Features that only occur once or twice in the corpus
won’t generalize, so they are a waste of time to include in the vector space
Error Analysis
Any Questions?