Intelligently Creating and Recommending
Reusable Reformatting Rules
Christopher ScaffidiBrad Myers, Mary Shaw
Carnegie Mellon University
22
People often use spreadsheets to People often use spreadsheets to store and organize “string” datastore and organize “string” data
According to study by Univ. Nebraska, nearly 40% of spreadsheet cells are strings (ie: not numbers, formulas, or dates)
Example task found while observing administrative assistants(contextual inquiry)…
• Build a roster of employee contact info– Visit several project teams’ web sites– Copy data from web sites into spreadsheet– Manually put data into consistent format
(because users care about formatting when creating reports)
Introduction Editor Recommendation Evaluation
33
Mishmash of formats and invalid stringsMishmash of formats and invalid strings
33Introduction Editor Recommendation Evaluation
- illustrative example (not actually the spreadsheet in the contextual inquiry)
- part of an actual spreadsheet from CMU web site
44
Needed: automated support for validating Needed: automated support for validating and reformatting domain-specific stringsand reformatting domain-specific strings
44
• Finding and fixing strings is tedious and error-prone
• Excel and other tools provide no features for automatically reformatting domain-specific strings– Only for numeric data & a few specific kinds of strings
(not domain-extensible)
Introduction Editor Recommendation Evaluation
55
Underlying problem: abstraction mismatchUnderlying problem: abstraction mismatch
• Tools support strings, ints, floats, sometimes dates.
• Problem domain involves higher-level, multi-format categories of strings:– Person names– CMU department names– CMU course numbers– CMU building room numbers
Introduction Editor Recommendation Evaluation
66
Tope: Tope: Each tope describes how toEach tope describes how tovalidate and reformat one kind of stringvalidate and reformat one kind of string
A notional depiction of a tope for CMU room numbers…
Node = format, edge = reformatting rule
Formal building name& room number
Elliot Dunlap Smith Hall 225
Colloquial building name& room number
Smith 225
Building abbreviation& room number
EDSH 225
Introduction Editor Recommendation Evaluation
77
What’s new and interesting today?What’s new and interesting today?Auto-reformatting and recommendationAuto-reformatting and recommendation• Previous work:
– Early tope editing tool for creating topes to validate and reformat spreadsheet, web form and web macro data [ICSE’08, FSE’08]
– Inferring new topes from example strings [ICEIS’07]– Usability evaluation of the early tope editing tool [ISEUD’09]
• Limitations of previous work:– Tedious to implement reformatting rules– Tedious to reuse topes
• Contributions today:– Automatic reformatting– Tope recommendation
Introduction Editor Recommendation Evaluation
88
New “Format As” featureNew “Format As” feature
88Introduction Editor Recommendation Evaluation
99
Today’s presentationToday’s presentation
• Introduction– Problem overview– Topes overview
• Tope editing tool: Toped++
• Tope recommendation
• Evaluation– Evaluation of usability – Evaluation of accuracy & speed
• Conclusion
Introduction Editor Recommendation Evaluation
1010
Creating a new topeCreating a new tope
Highlight cells containing example strings… system infers a boilerplate tope
Introduction Editor Recommendation Evaluation
1111
Data Description EditorData Description EditorTopedToped++++: an improved editor for topes: an improved editor for topes
1111Introduction Editor Recommendation Evaluation
1212
Whitelist tabWhitelist tab
Introduction Editor Recommendation Evaluation
Other kinds of data easily described with a whitelist:• US state names & abbreviations• Campus building names & abbreviations
1313
Auto-reformattingAuto-reformatting
• Topes with a single word-like part– 4 formats: UPPER CASE, lower case, Title Case, miXeD cAse
• Topes with a single numeric part– One format per # digits allowed: pad with “0” and/or round
• Topes with multiple parts and separators– (Recursively) reformat each part, concatenate with separators
• Topes that also have a whitelist– One format per synonym column: use lookup table
• Important: after reformatting, test the resulting string against the target format’s grammar to detect errors.
Introduction Editor Recommendation Evaluation
1414
• Introduction– Problem overview– Topes overview
• Tope editing tool: Toped++
• Tope recommendation
• Evaluation– Evaluation of usability – Evaluation of accuracy & speed
• Conclusion
1515
Supporting reuse: Recommendation via Supporting reuse: Recommendation via search-by-match algorithmsearch-by-match algorithm
Introduction Editor Recommendation Evaluation
Algorithm summary:
1. Sort topes by # keywords hit2. Break ties by testing examples against whitelists3. Break remaining ties by testing examples against
the rest of the tope
1616
Implementation details: Implementation details: Speeding up the recommendationSpeeding up the recommendation
Introduction Editor Recommendation Evaluation
• Counting keyword hits and whitelist hits is easy– just use an inverted index.
• But testing every example on every tope is wasteful
• Why test a tope if it couldn’t match anyway?
• For example, if a phone number can only match formats like “808-202-3030” and “808.202.3030”, then it only needs to be tested against examples that have 10 digits and 2 hyphens or digits.– Index topes according to their “character content”
1717
• Introduction– Problem overview– Topes overview
• Tope editing tool: Toped++
• Tope recommendation
• Evaluation– Evaluation of usability – Evaluation of accuracy & speed
• Conclusion
1818
Evaluating usability for fixingEvaluating usability for fixingspreadsheet dataspreadsheet data
• 9 master’s students, primarily in business• Baseline: fixing strings manually
• Within-subject study design with 4 phases:– Tutorial task (up to 30 minutes)
– Three tasks using Toped++ (up to 30 minutes total)• using Toped++ to fix typos and reformat 100 cells, each
– Same three tasks manually (up to 1 minute each)
– Satisfaction questionnaire
Introduction Editor Recommendation Evaluation
1919
Task detailsTask details
• Each task = Find and fix typos in 100 spreadsheet cells, then put the cells into a specified format– Eg: add “.com” to email addresses lacking top-level
domain, then reformat like “[email protected]”
• Different kinds of data assigned to different users:– 3 users: Person first name, last name, university
(single-part Word-like topes)– 3 users: Course number, state name, country name
(whitelist-driven topes; we provided whitelists from web)– 3 users: Email address, phone number, person name
(multi-part topes)
Introduction Editor Recommendation Evaluation
2020
Usability:Usability:Improves user speed with negligible errorsImproves user speed with negligible errors
Minutes Required Breakeven Toped++
(actual) Manual (projected)
point(# cells)
Group 1: Single word data 3.0 5.0 60
Group 2: Whitelist data 6.9 15.9 43
Group 3: Multi-part data 3.6 10.2 35
Overall Average: 4.5 9.6 47
Introduction Editor Recommendation Evaluation
with ~ 1/1000 error rate
Projected, based on how many secondsparticipants spent fixing typos & reformatting
each cell
Even without reuse!
2121
User satisfaction:User satisfaction:They want to use topesThey want to use topes
• User preference: Toped++ or doing tasks manually– Every user strongly preferred Toped++
• 5-point Likert scales asking… – How easy Toped++ was to use– How much users trusted it– How pleasant it was to use– If they would use it if made available
– Every participant but one gave a score of 4 or 5 on every question (the good end of the scale)
• Two users described how they wished a tool like this had been available in previous office environments
Introduction Editor Recommendation Evaluation
2222
Evaluating accuracy and speed Evaluating accuracy and speed of tope recommendationof tope recommendation
• Prior study found that 32 categories covered 70% of columns that could be categorized in the EUSES spreadsheet corpus
• Evaluate accuracy & speed of tope recommendation– Create a tope in Toped++ for each data category– Randomly choose a subset of these topes– Randomly choose examples from a column– Grab keywords from the column header– Query for a tope: Is it right? How long does query take?– Repeat many times
– Then vary # topes, # examples, keywords to measure impact on accuracy & speed
Introduction Editor Recommendation Evaluation
2323
Recommendation accuracy:Recommendation accuracy:Even a short menu usually has right topeEven a short menu usually has right tope
2323Introduction Editor Recommendation Evaluation
# choices in the drop down menu (result set size)
# Examples; Use keywords?
2424
Recommendation speed: Recommendation speed: Menu can be populated in < 1 secondMenu can be populated in < 1 second
2424Introduction Editor Recommendation Evaluation
Number of topes on the computer to choose from
# Examples; Use keywords?
2525
TopedToped++++: first system to integrate user-extensible : first system to integrate user-extensible string validation with executable reformatting rulesstring validation with executable reformatting rules
• Other tools described in Related Work:– Grammex & SWYN: No reformatting rules– Potluck & Lapis: No “replayable” reformatting rules– Nix edit-by-example: No validation
• RE-Trees: search-by-match for regular expressions
• Topes is basically one way to model named entities, a central concept in information extraction research
Introduction Editor Recommendation Evaluation
2626
ConclusionConclusion
• Contributions– Auto-generate reformatting rules
• Very strongly preferred by users• Users quickly & correctly fix typos and reformat data
– Recommend based on examples of strings to match• Good accuracy based on even just a few strings• Fast enough to search user’s computer as he works
• Future Opportunities– Improving accuracy of recommendations
• Learn from user responses to previous recommendations• Provide repository for intra-organizational tope reuse
– Further integrations• Adding reformatting-based Joins to DataSpaces?
Introduction Editor Recommendation Evaluation
2727
Thank You…Thank You…
• To Margaret Burnett, James Lin, Simone Stumpf, Weng-Keen Wong and others in the EUSES Consortium for feedback over the years on topes
• To NSF for funding
• To IUI 2009 for this opportunity to present
2828
ReferencesReferences
ICSE’08Topes data model
C. Scaffidi, B. Myers, and M. Shaw. Topes: Reusable Abstractions for Validating Data, International Conference on Software Engineering (ICSE 2008), Leipzig, Germany, May 2008, pp. 1-10.
ISEUD’09User eval early tool
C. Scaffidi, B. Myers, and M. Shaw. Fast, Accurate Creation of Data Validation Formats by End-User Developers. 2nd International Symposium on End-User Development (ISEUD 2009), March 2009, to appear.
FSE’08Use in web macros
A. Koesnandar, S. Elbaum, G. Rothermel, L. Hochstein, K. Thomasset, and C. Scaffidi. Using Assertions to Help End-User Programmers Create Dependable Web Macros. Proc. 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2008), Atlanta, GA, November 2008, 124-134.
ICEIS’07Inferring new topes
C. Scaffidi. Unsupervised Inference of Data Formats in Human-Readable Notation. Proceedings of 9th International Conference on Enterprise Information Systems - HCI Volume (ICEIS 2007), Madeira, Portugal, June 2007, pp. 236-241.
Top Related