Better Machine Learning Through Data -...
Transcript of Better Machine Learning Through Data -...
![Page 1: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/1.jpg)
Better Machine Learning Through Data
Saleema AmershiMachine Teaching GroupMicrosoft Research
August 14, 2016
![Page 2: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/2.jpg)
Making better sense of data.
Better data makes better machine learning.
![Page 3: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/3.jpg)
Data + Algorithm = Model
Data ModelAlgorithm
![Page 4: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/4.jpg)
Data + Algorithm = Model
Data ModelAlgorithm
Machine learning research often takes the data as given.
![Page 5: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/5.jpg)
When Algorithms Discriminate – The New York Times, 2015
Big Data’s all-too-human failings – Reuters, 2016
Artificial Intelligence’s White Guy Problem – The New York Times, 2016
Mapping Crime – Or Stirring Hate?– Financial Times, 2014
![Page 6: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/6.jpg)
Making better sense of data.
Better data makes better machine learning.
Most influence practitioners have on machine learning is through data.
![Page 7: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/7.jpg)
Data + Algorithm = Model
Data ModelAlgorithm
In research,
data is often taken as given.
![Page 8: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/8.jpg)
Algorithm
Data + Algorithm = Model
ModelData
In practice, the
algorithm is often taken as given.
In research,
data is often taken as given.
![Page 9: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/9.jpg)
Algorithm
Data + Algorithm = Model
ModelData
“Data scientists, according to interviews and
expert estimates, spend 50 percent to 80 percent
of their time mired in this more mundane labor
of collecting and preparing unruly digital data.”
- New York Times, 2014
In practice, the
algorithm is often taken as given.
![Page 10: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/10.jpg)
Algorithm
Data + Algorithm = Model
ModelData
![Page 11: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/11.jpg)
[Patel et al., CHI 2008]
![Page 12: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/12.jpg)
Algorithm
Data + Algorithm = Model
ModelData
![Page 13: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/13.jpg)
Algorithm
Data + Algorithm = Model
Data Model
Iterations are driven by evaluating models on data.
![Page 14: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/14.jpg)
Algorithm
Data + Algorithm = Model
ModelData
Iterations are driven by evaluating models on data.
In practice, most effort is spent crafting input data.
![Page 15: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/15.jpg)
Algorithm ModelData
Machine learning in theory
![Page 16: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/16.jpg)
Algorithm
Collect &
Label
Samples
Create
Features
Evaluate
Results
Machine learning in practice
![Page 17: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/17.jpg)
AlgorithmEvaluate
Results
Collect &
Label
Samples
Create
Features
![Page 18: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/18.jpg)
AlgorithmEvaluate
Results
Collect &
Label
Samples
Create
Features
Structured Labeling [CHI 2014]
Feature Insight [VAST 2015]
ModelTracker[CHI 2015, VAST 2016]
![Page 19: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/19.jpg)
Evaluate
Results
Create
FeaturesAlgorithm
Feature Insight [VAST 2015]
ModelTracker[CHI 2015, VAST 2016]
Collect &
Label
Samples
Structured Labeling [CHI 2014]
![Page 20: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/20.jpg)
Traditional Labeling
Pre-defined high-level categories.
Cat
Not Cat
Is this a Cat?
0
0
![Page 21: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/21.jpg)
Traditional Labeling
Pre-defined high-level categories.
Cat
Not Cat
Is this a Cat?
0
0
![Page 22: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/22.jpg)
Traditional Labeling
Pre-defined high-level categories.
Cat
Not Cat
Is this a Cat?
1
0
![Page 23: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/23.jpg)
Traditional Labeling
Pre-defined high-level categories.
Cat
Not Cat
Is this a Cat?
1
0
![Page 24: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/24.jpg)
Traditional Labeling
Pre-defined high-level categories.
Cat
Not Cat
Is this a Cat?
2
0
![Page 25: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/25.jpg)
Traditional Labeling
Pre-defined high-level categories.
Cat
Not Cat
Is this a Cat?
2
0
![Page 26: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/26.jpg)
Traditional Labeling
Pre-defined high-level categories.
Cat
Not Cat
Is this a Cat?
2
1
![Page 27: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/27.jpg)
Traditional Labeling
Pre-defined high-level categories.
Cat
Not Cat
Is this a Cat?
2
1
![Page 28: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/28.jpg)
Traditional Labeling
Pre-defined high-level categories.
Cat
Not Cat
Is this a Cat?
2
2
![Page 29: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/29.jpg)
Traditional Labeling
Pre-defined high-level categories.
Cat
Not Cat
Is this a Cat?
2
2
![Page 30: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/30.jpg)
Traditional Labeling
Pre-defined high-level categories.
Cat
Not Cat
Is this a Cat?
3
2
![Page 31: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/31.jpg)
Traditional Labeling
Pre-defined high-level categories.
Cat
Not Cat
Is this a Cat?
3
2
![Page 32: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/32.jpg)
Traditional Labeling
Pre-defined high-level categories.
Cat
Not Cat
Is this a Cat?
4
2
![Page 33: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/33.jpg)
Traditional Labeling
Pre-defined high-level categories.
Cat
Not Cat
Is this a Cat?
4
2
![Page 34: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/34.jpg)
Traditional Labeling
Pre-defined high-level categories.
Cat
Not Cat
Is this a Cat?
4
3
![Page 35: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/35.jpg)
Traditional Labeling
Pre-defined high-level categories.
Cat
Not Cat
Is this a Cat?
4
3
![Page 36: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/36.jpg)
Traditional Labeling
Pre-defined high-level categories.
Cat
Not Cat
Is this a Cat?
4
3
Does not support
concept evolution
(refining the target
concept as data is
observed).
![Page 37: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/37.jpg)
How common is concept evolution?
Nine machine learning experts labeled the same 200 pages in two sessions (4 weeks apart).
Average consistency 81.7% (SD=6.8%)
6 out of 9 people’s labels changed significantly (via Chi Square test of symmetry)
25
50
75
100
Co
nsi
sten
cy
Participants
![Page 38: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/38.jpg)
Proposed Solution – Structured Labeling
Enable people to explicitly organize their concept via grouping and tagging within a traditional labeling scheme.
![Page 39: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/39.jpg)
Traditional Labeling
Pre-defined high-level categories.
Cat
Not Cat
Is this a Cat?
2
2
![Page 40: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/40.jpg)
Structured Labeling
Cat
Not Cat
Is this a Cat?
2
Definitely Cat
2
Definitely Not Cat
![Page 41: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/41.jpg)
Structured Labeling
Cat
Not Cat
Is this a Cat?
2
Definitely Cat
2
Definitely Not Cat
![Page 42: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/42.jpg)
Structured Labeling
Cat
Not Cat
Is this a Cat?
2
Definitely Cat
2
Definitely Not Cat
2
2
Definitely Not Cat
Cat Poster
1
User provided tags on groups aid recall.
Grouping within high-level categories.
![Page 43: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/43.jpg)
Structured Labeling
Cat
Not Cat
Is this a Cat?
2
Definitely Cat
2
Definitely Not Cat
2
2
Definitely Not Cat
Cat Poster
1
User provided tags on groups aid recall.
Grouping within high-level categories.Blogs
2
Lions
2
![Page 44: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/44.jpg)
Structured Labeling
Cat
Not Cat
Is this a Cat?
2
Definitely Cat
2
Definitely Not Cat
2
2
Definitely Not Cat
Cat Poster
1
User provided tags on groups aid recall.
Grouping within high-level categories.Blogs
2
Lions
2
![Page 45: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/45.jpg)
Structured Labeling
Cat
Not Cat
Is this a Cat?
2
Definitely Cat
2
Definitely Not Cat
2
2
Definitely Not Cat
User provided tags on groups aid recall.
Grouping within high-level categories.Blogs
2
Lions
2
Cat Poster
2
![Page 46: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/46.jpg)
Structured Labeling
Cat
Not Cat
Is this a Cat?
2
Definitely Cat
2
Definitely Not Cat
2
2
Definitely Not Cat
User provided tags on groups aid recall.
Grouping within high-level categories.
Lions
2
Blogs
2
Cat Poster
2
Can move, merge
and split groups
as desired.
![Page 47: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/47.jpg)
Assisted Structured Labeling
Cat
Not Cat
Is this a Cat?
2
Definitely Cat
2
Definitely Not Cat
2
2
Definitely Not Cat
Lions
2
Blogs
2
Cat Poster
2
Grouping recommendations to improve label consistency.
![Page 48: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/48.jpg)
Assisted Structured Labeling
Cat
Not Cat
Is this a Cat?
2
Definitely Cat
2
Definitely Not Cat
2
2
Definitely Not Cat
Lions
2
Blogs
2
Cat Poster
2
Grouping recommendations to improve label consistency.
Similar items to help users make decisions.
![Page 49: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/49.jpg)
Findings
People revised labels significantly more with structured labeling
People labeled more consistently
People preferred it over traditional labeling
Label Consistency
(X2=6.53, df=2, p < .038)(X2=20.19, df=2, p < .001)
Mean # Groups
(X2=12, df=2, p < .002)
# Revisions
![Page 50: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/50.jpg)
Structured Labeling Summary
Current tools do not support concept evolution.
Structured labeling helps people refine their concepts by surfacing labeling decisions and aiding recall.
People used structured labeling when it was available and labeled more consistently.
Structure contains additional information (e.g., group related features, group related accuracy, decisions made…)
![Page 51: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/51.jpg)
Evaluate
ResultsAlgorithm
ModelTracker[CHI 2015, VAST 2016]
Collect &
Label
Samples
Create
Features
Feature Insight [VAST 2015]
Structured labeling improves consistency
[CHI 2014]
![Page 52: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/52.jpg)
“At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.” [Domingos, CACM 2012]
…yet, little guidance or best practices exist.
![Page 53: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/53.jpg)
How do people come up with features?
Look for features used in related domains.
Use intuition or domain knowledge.
Apply automated techniques
Feature ideation – Think of and experiment with custom features (a “black art”).
![Page 54: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/54.jpg)
Proposed Solution – Feature Insight
Support compare and contrast of data.
![Page 55: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/55.jpg)
What makes a cat a cat?
![Page 56: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/56.jpg)
What makes a cat a cat?
![Page 57: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/57.jpg)
Proposed Solution – Feature Insight
Support compare and contrast of data.
Comparing pairs vs sets?
![Page 58: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/58.jpg)
Comparing Pairs vs Sets
Sets may help people think of generalizable features.
NegativePositivePositives Negatives
vs
![Page 59: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/59.jpg)
Proposed Solution – Feature Insight
Support compare and contrast of data.
Comparing pairs vs sets?
Raw data vs visual summaries?
![Page 60: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/60.jpg)
Looking at Raw Data vs. Visual Summaries
Visual summaries may reveal relevant characteristics and hide irrelevant noise.
vs
Raw Data Visual Summary
![Page 61: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/61.jpg)
Individual Comparison Set Comparison
Raw
Dat
aV
isu
alSu
mm
arie
s
![Page 62: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/62.jpg)
0.0
0.5
1.0
ClassifierPerformance (p<.01)
0
2
4
6
Feature Count
0
1
2
3
Preference Rank(Small is better) (p=.03)
Raw + Individual
Raw + Set
Visual + Individual
Visual + Set
Findings
Visual summaries led to better features
Visual summaries preferred over looking at raw data
Sets useful only in combination with visuals
![Page 63: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/63.jpg)
Feature Insight Summary
Featuring is arguably the most important step in machine learning, but there is little guidance on feature ideation.
Feature Insight supports error comparison, examination of sets, and visual summaries.
Visual summaries help people create better quality features.
![Page 64: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/64.jpg)
Algorithm
Collect &
Label
Samples
Structured Labeling [CHI 2014]
Create
Features
Feature Insight [VAST 2015]
Evaluate
Results
ModelTracker[CHI 2015, VAST 2016]
![Page 65: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/65.jpg)
AlgorithmEvaluate
Results
ModelTracker[CHI 2015, VAST 2016]
Collect &
Label
Samples
Structured Labeling [CHI 2014]
Create
Features
Feature Insight [VAST 2015]
How do people evaluate performance?
![Page 66: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/66.jpg)
AlgorithmEvaluate
Results
Collect &
Label
Samples
Create
Features
0.710.67 0.70 ???
Predicted
Act
ual
Positive Negative
Positive 143 72
Negative 35 190
How do people evaluate performance?
Summary statistics hide important
information about model behavior.
![Page 67: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/67.jpg)
AlgorithmEvaluate
Results
Collect &
Label
Samples
Create
Features
0.710.67 0.70 ???
Predicted
Act
ual
Positive Negative
Positive 143 72
Negative 35 190
How do people evaluate performance?
Summary statistics hide important
information about model behavior.
![Page 68: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/68.jpg)
AlgorithmEvaluate
Results
Collect &
Label
Samples
Create
Features
0.710.67 0.70 ???
Predicted
Act
ual
Positive Negative
Positive 143 72
Negative 35 190
Summary statistics hide important
information about model behavior.
Switching tools to examine data is
disruptive and leads to a trial-and-
error approach [Patel et al., AAAI 2008].
How do people evaluate performance?
![Page 69: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/69.jpg)
Example: Predicting Income Levels
![Page 70: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/70.jpg)
![Page 71: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/71.jpg)
Decision Tree
86% Accuracy
Support Vector Machine
85% Accuracy
![Page 72: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/72.jpg)
Decision Tree
86% Accuracy
Support Vector Machine
85% Accuracy
![Page 73: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/73.jpg)
ModelTracker Demo
![Page 74: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/74.jpg)
Significantly faster and more accurate performance analysis
ModelTracker Common Confusion Matrix
![Page 75: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/75.jpg)
ModelTracker Summary
Current tools for performance analysis and debugging hide a lot of important information about model behavior.
ModelTracker supports estimating performance at multiple levels of granularity while enabling direct access to data.
People are significantly faster and more accurate at performance analysis with ModelTracker.
![Page 76: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/76.jpg)
AlgorithmEvaluate
Results
ModelTracker[CHI 2015, VAST 2016]
Collect &
Label
Samples
Structured Labeling [CHI 2014]
Create
Features
Feature Insight [VAST 2015]
![Page 77: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/77.jpg)
TuneCollect Clean DeployTrain
Many more opportunities to better support machine learning in practice.
Label Feature Evaluate
![Page 78: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/78.jpg)
Tune EvaluateLabel FeatureCollect Clean Train Deploy
Many more opportunities to better support machine learning in practice.
![Page 79: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/79.jpg)
Tune EvaluateFeatureCollect Clean DeployLabel Train
Many more opportunities to better support machine learning in practice and theory.
![Page 80: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/80.jpg)
Making better sense of data.
Better data means better machine learning.
Most influence practitioners have on machine learning is through data.
Many more opportunities!
![Page 81: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine](https://reader034.fdocuments.in/reader034/viewer/2022043008/5f98ca7c6e4286138a5f0836/html5/thumbnails/81.jpg)
Better Machine Learning Through Data
Saleema Amershi, [email protected] Teaching GroupMicrosoft Research
August 14, 2016
Thanks! Questions?