Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers,...

28
Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University

Transcript of Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers,...

Page 1: Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

Intelligently Creating and Recommending

Reusable Reformatting Rules

Christopher ScaffidiBrad Myers, Mary Shaw

Carnegie Mellon University

Page 2: Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

22

People often use spreadsheets to People often use spreadsheets to store and organize “string” datastore and organize “string” data

According to study by Univ. Nebraska, nearly 40% of spreadsheet cells are strings (ie: not numbers, formulas, or dates)

Example task found while observing administrative assistants(contextual inquiry)…

• Build a roster of employee contact info– Visit several project teams’ web sites– Copy data from web sites into spreadsheet– Manually put data into consistent format

(because users care about formatting when creating reports)

Introduction Editor Recommendation Evaluation

Page 3: Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

33

Mishmash of formats and invalid stringsMishmash of formats and invalid strings

33Introduction Editor Recommendation Evaluation

- illustrative example (not actually the spreadsheet in the contextual inquiry)

- part of an actual spreadsheet from CMU web site

Page 4: Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

44

Needed: automated support for validating Needed: automated support for validating and reformatting domain-specific stringsand reformatting domain-specific strings

44

• Finding and fixing strings is tedious and error-prone

• Excel and other tools provide no features for automatically reformatting domain-specific strings– Only for numeric data & a few specific kinds of strings

(not domain-extensible)

Introduction Editor Recommendation Evaluation

Page 5: Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

55

Underlying problem: abstraction mismatchUnderlying problem: abstraction mismatch

• Tools support strings, ints, floats, sometimes dates.

• Problem domain involves higher-level, multi-format categories of strings:– Person names– CMU department names– CMU course numbers– CMU building room numbers

Introduction Editor Recommendation Evaluation

Page 6: Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

66

Tope: Tope: Each tope describes how toEach tope describes how tovalidate and reformat one kind of stringvalidate and reformat one kind of string

A notional depiction of a tope for CMU room numbers…

Node = format, edge = reformatting rule

Formal building name& room number

Elliot Dunlap Smith Hall 225

Colloquial building name& room number

Smith 225

Building abbreviation& room number

EDSH 225

Introduction Editor Recommendation Evaluation

Page 7: Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

77

What’s new and interesting today?What’s new and interesting today?Auto-reformatting and recommendationAuto-reformatting and recommendation• Previous work:

– Early tope editing tool for creating topes to validate and reformat spreadsheet, web form and web macro data [ICSE’08, FSE’08]

– Inferring new topes from example strings [ICEIS’07]– Usability evaluation of the early tope editing tool [ISEUD’09]

• Limitations of previous work:– Tedious to implement reformatting rules– Tedious to reuse topes

• Contributions today:– Automatic reformatting– Tope recommendation

Introduction Editor Recommendation Evaluation

Page 8: Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

88

New “Format As” featureNew “Format As” feature

88Introduction Editor Recommendation Evaluation

Page 9: Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

99

Today’s presentationToday’s presentation

• Introduction– Problem overview– Topes overview

• Tope editing tool: Toped++

• Tope recommendation

• Evaluation– Evaluation of usability – Evaluation of accuracy & speed

• Conclusion

Introduction Editor Recommendation Evaluation

Page 10: Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

1010

Creating a new topeCreating a new tope

Highlight cells containing example strings… system infers a boilerplate tope

Introduction Editor Recommendation Evaluation

Page 11: Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

1111

Data Description EditorData Description EditorTopedToped++++: an improved editor for topes: an improved editor for topes

1111Introduction Editor Recommendation Evaluation

Page 12: Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

1212

Whitelist tabWhitelist tab

Introduction Editor Recommendation Evaluation

Other kinds of data easily described with a whitelist:• US state names & abbreviations• Campus building names & abbreviations

Page 13: Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

1313

Auto-reformattingAuto-reformatting

• Topes with a single word-like part– 4 formats: UPPER CASE, lower case, Title Case, miXeD cAse

• Topes with a single numeric part– One format per # digits allowed: pad with “0” and/or round

• Topes with multiple parts and separators– (Recursively) reformat each part, concatenate with separators

• Topes that also have a whitelist– One format per synonym column: use lookup table

• Important: after reformatting, test the resulting string against the target format’s grammar to detect errors.

Introduction Editor Recommendation Evaluation

Page 14: Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

1414

• Introduction– Problem overview– Topes overview

• Tope editing tool: Toped++

• Tope recommendation

• Evaluation– Evaluation of usability – Evaluation of accuracy & speed

• Conclusion

Page 15: Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

1515

Supporting reuse: Recommendation via Supporting reuse: Recommendation via search-by-match algorithmsearch-by-match algorithm

Introduction Editor Recommendation Evaluation

Algorithm summary:

1. Sort topes by # keywords hit2. Break ties by testing examples against whitelists3. Break remaining ties by testing examples against

the rest of the tope

Page 16: Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

1616

Implementation details: Implementation details: Speeding up the recommendationSpeeding up the recommendation

Introduction Editor Recommendation Evaluation

• Counting keyword hits and whitelist hits is easy– just use an inverted index.

• But testing every example on every tope is wasteful

• Why test a tope if it couldn’t match anyway?

• For example, if a phone number can only match formats like “808-202-3030” and “808.202.3030”, then it only needs to be tested against examples that have 10 digits and 2 hyphens or digits.– Index topes according to their “character content”

Page 17: Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

1717

• Introduction– Problem overview– Topes overview

• Tope editing tool: Toped++

• Tope recommendation

• Evaluation– Evaluation of usability – Evaluation of accuracy & speed

• Conclusion

Page 18: Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

1818

Evaluating usability for fixingEvaluating usability for fixingspreadsheet dataspreadsheet data

• 9 master’s students, primarily in business• Baseline: fixing strings manually

• Within-subject study design with 4 phases:– Tutorial task (up to 30 minutes)

– Three tasks using Toped++ (up to 30 minutes total)• using Toped++ to fix typos and reformat 100 cells, each

– Same three tasks manually (up to 1 minute each)

– Satisfaction questionnaire

Introduction Editor Recommendation Evaluation

Page 19: Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

1919

Task detailsTask details

• Each task = Find and fix typos in 100 spreadsheet cells, then put the cells into a specified format– Eg: add “.com” to email addresses lacking top-level

domain, then reformat like “[email protected]

• Different kinds of data assigned to different users:– 3 users: Person first name, last name, university

(single-part Word-like topes)– 3 users: Course number, state name, country name

(whitelist-driven topes; we provided whitelists from web)– 3 users: Email address, phone number, person name

(multi-part topes)

Introduction Editor Recommendation Evaluation

Page 20: Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

2020

Usability:Usability:Improves user speed with negligible errorsImproves user speed with negligible errors

Minutes Required Breakeven Toped++

(actual) Manual (projected)

point(# cells)

Group 1: Single word data 3.0 5.0 60

Group 2: Whitelist data 6.9 15.9 43

Group 3: Multi-part data 3.6 10.2 35

Overall Average: 4.5 9.6 47

Introduction Editor Recommendation Evaluation

with ~ 1/1000 error rate

Projected, based on how many secondsparticipants spent fixing typos & reformatting

each cell

Even without reuse!

Page 21: Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

2121

User satisfaction:User satisfaction:They want to use topesThey want to use topes

• User preference: Toped++ or doing tasks manually– Every user strongly preferred Toped++

• 5-point Likert scales asking… – How easy Toped++ was to use– How much users trusted it– How pleasant it was to use– If they would use it if made available

– Every participant but one gave a score of 4 or 5 on every question (the good end of the scale)

• Two users described how they wished a tool like this had been available in previous office environments

Introduction Editor Recommendation Evaluation

Page 22: Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

2222

Evaluating accuracy and speed Evaluating accuracy and speed of tope recommendationof tope recommendation

• Prior study found that 32 categories covered 70% of columns that could be categorized in the EUSES spreadsheet corpus

• Evaluate accuracy & speed of tope recommendation– Create a tope in Toped++ for each data category– Randomly choose a subset of these topes– Randomly choose examples from a column– Grab keywords from the column header– Query for a tope: Is it right? How long does query take?– Repeat many times

– Then vary # topes, # examples, keywords to measure impact on accuracy & speed

Introduction Editor Recommendation Evaluation

Page 23: Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

2323

Recommendation accuracy:Recommendation accuracy:Even a short menu usually has right topeEven a short menu usually has right tope

2323Introduction Editor Recommendation Evaluation

# choices in the drop down menu (result set size)

# Examples; Use keywords?

Page 24: Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

2424

Recommendation speed: Recommendation speed: Menu can be populated in < 1 secondMenu can be populated in < 1 second

2424Introduction Editor Recommendation Evaluation

Number of topes on the computer to choose from

# Examples; Use keywords?

Page 25: Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

2525

TopedToped++++: first system to integrate user-extensible : first system to integrate user-extensible string validation with executable reformatting rulesstring validation with executable reformatting rules

• Other tools described in Related Work:– Grammex & SWYN: No reformatting rules– Potluck & Lapis: No “replayable” reformatting rules– Nix edit-by-example: No validation

• RE-Trees: search-by-match for regular expressions

• Topes is basically one way to model named entities, a central concept in information extraction research

Introduction Editor Recommendation Evaluation

Page 26: Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

2626

ConclusionConclusion

• Contributions– Auto-generate reformatting rules

• Very strongly preferred by users• Users quickly & correctly fix typos and reformat data

– Recommend based on examples of strings to match• Good accuracy based on even just a few strings• Fast enough to search user’s computer as he works

• Future Opportunities– Improving accuracy of recommendations

• Learn from user responses to previous recommendations• Provide repository for intra-organizational tope reuse

– Further integrations• Adding reformatting-based Joins to DataSpaces?

Introduction Editor Recommendation Evaluation

Page 27: Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

2727

Thank You…Thank You…

• To Margaret Burnett, James Lin, Simone Stumpf, Weng-Keen Wong and others in the EUSES Consortium for feedback over the years on topes

• To NSF for funding

• To IUI 2009 for this opportunity to present

Page 28: Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

2828

ReferencesReferences

ICSE’08Topes data model

C. Scaffidi, B. Myers, and M. Shaw. Topes: Reusable Abstractions for Validating Data, International Conference on Software Engineering (ICSE 2008), Leipzig, Germany, May 2008, pp. 1-10.

ISEUD’09User eval early tool

C. Scaffidi, B. Myers, and M. Shaw. Fast, Accurate Creation of Data Validation Formats by End-User Developers. 2nd International Symposium on End-User Development (ISEUD 2009), March 2009, to appear.

FSE’08Use in web macros

A. Koesnandar, S. Elbaum, G. Rothermel, L. Hochstein, K. Thomasset, and C. Scaffidi. Using Assertions to Help End-User Programmers Create Dependable Web Macros. Proc. 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2008), Atlanta, GA, November 2008, 124-134.

ICEIS’07Inferring new topes

C. Scaffidi. Unsupervised Inference of Data Formats in Human-Readable Notation. Proceedings of 9th International Conference on Enterprise Information Systems - HCI Volume (ICEIS 2007), Madeira, Portugal, June 2007, pp. 236-241.