Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie...

24
Topes: Reusable Abstractions Topes: Reusable Abstractions for Validating Data for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    0

Transcript of Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie...

Page 1: Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

Topes: Reusable Abstractions for Topes: Reusable Abstractions for Validating DataValidating Data

Christopher Scaffidi

Brad Myers, Mary Shaw

Carnegie Mellon University

Page 2: Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

22

Even when lives are at stake,Even when lives are at stake,people still make typos. people still make typos.

Hurricane Katrina“Person Locator”

Web site

Problem Topes Validation Conclusion

Page 3: Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

33

Data errors reduce the usefulness of data.Data errors reduce the usefulness of data.

Wrong data category

Problem Topes Validation Conclusion

Questionable input

Incorrectformatting

Page 4: Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

44

The website creators omitted input The website creators omitted input validation.validation.

• Primary reason: rejecting obviously-wrong inputs would prevent collecting questionable data

– Eg: Would you accept a city with 1 letter?

This is the UI code for the web form where users entered data for this website.A RAD tool called CodeCharge Studio was used to create the UI.

Problem Topes Validation Conclusion

Page 5: Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

55

This site was not alone in This site was not alone in lacking input validation.lacking input validation.

• Eg: Google Base web application– 13 primary web forms – Even numeric fields accept unreasonable inputs

(such as a salary of “-45”)

• Eg: Spreadsheets– 40% of cells are non-numeric, non-date textual data– Commonly used to gather and organize textual data

for reports

Problem Topes Validation Conclusion

Page 6: Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

66

Validation of these short human-readable Validation of these short human-readable strings must support…strings must support…

• Testing membership in a data category– Categories based on standards (eg: email address)– Categories lacking standards (eg: city name)

• Ambiguously defined categories– Identify questionable values for double-checking

• Multiple formats– Format consistency, post-validation

• Platform-independent implementation– Reuse in webapps, spreadsheets, others

Problem Topes Validation Conclusion

Page 7: Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

77

Limitations of existing approachesLimitations of existing approaches

• Types do not support questionable values

• Grammars do not, either, nor can they reformat

• Information extraction algorithms rely on grammatical cues that are absent during validation

• Cues, Forms/3, -calculus, Slate, pollution markers, etc, infer numerical constraints but not constraints on strings, nor are they platform-independent

Problem Topes Validation Conclusion

Page 8: Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

88

New Approach: TopesNew Approach: Topes

• A tope = a platform-independent abstraction describing how to recognize and transform strings in one category of data

• Greek word for “place,” because each corresponds to a data category with a natural place in the problem domain

• Validating with topes improves– Accuracy of validation– Reusability of validation code– Subsequent duplicate identification

Problem Topes Validation Conclusion

Page 9: Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

99

A tope is a graph.A tope is a graph.Node = format, edge = transformationNode = format, edge = transformation

Notional representation for a CMU room number tope…

Formal building name& room number

Elliot Dunlap Smith Hall 225

Colloquial building name& room number

Smith 225

Problem Topes Validation Conclusion

Building abbreviation& room number

EDSH 225

Page 10: Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

1010

A tope is a conceptual abstraction.A tope is a conceptual abstraction.A tope A tope implementationimplementation is code. is code.

• Each tope implementation has executable functions:– 1 isa:string[0,1] function per format, for

recognizing instances of the format (a fuzzy set)– 0 or more trf:stringstring functions linking formats,

for transforming values from one format to another

• Validation function:(str) = max(isaf(str))where f ranges over tope’s formats– Valid when (str) = 1– Invalid when (str) = 0– Questionable when 0 < (str) < 1

Problem Topes Validation Conclusion

Page 11: Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

1111

Common kinds of topes:Common kinds of topes:enumerations and proper nouns enumerations and proper nouns

• Multi-format Enumerations, e.g: US states– “New York”, “CA”, maybe “Guam”

• Open-set proper nouns, e.g.: Company names– Whitelist of definitely valid names (“Google”), with

alternate formats (e.g. “Google Corp”, “GOOG”)– Augmented with a pattern for promising inputs that

are not yet on the whitelist

Problem Topes Validation Conclusion

Page 12: Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

1212

Two more common kinds of topes:Two more common kinds of topes:numeric and hierarchicalnumeric and hierarchical

• Numeric, e.g.: human masses– Numeric and in a certain range– Values slightly outside range might be questionable– (Very rarely) labeled with an explicit unit– Transformation usually by multiplication

• Hierarchical, e.g.: address lines– Parts described with other topes (e.g.: “100 Main St.”

uses a numeric, a proper noun, and an enum)– Simple isas can be implemented with regexps.– Transformations involve permutation of parts,

changes to separators, arithmetic, and lookup tables.

Problem Topes Validation Conclusion

Page 13: Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

1313

Formal tool demonstration on FridayFormal tool demonstration on Friday

Features:

• Format inference• Format/part names• Soft constraints• Testing features• Format reusability

Problem Topes Validation Conclusion

Page 14: Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

1414

Formal tool demonstration on FridayFormal tool demonstration on Friday

Microsoft Excel:

buttons and menus

Visual Studio: drag-and drop

code generation

Problem Topes Validation Conclusion

Page 15: Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

1515

Evaluating accuracy, reusability, and Evaluating accuracy, reusability, and usefulness for data cleaningusefulness for data cleaning

• Implemented topes for spreadsheet data– 32 topes based on 720 online spreadsheets– Tested accuracy

• Reused topes on web application data– 8 data categories in Google Base and

5 data categories in Hurricane Katrina site– Tested accuracy

• Used transformations to reformat data– 5 data categories in Hurricane Katrina site– Measured increase in number of duplicates identified

Problem Topes Validation Conclusion

Page 16: Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

1616

Extracting spreadsheet test dataExtracting spreadsheet test data

• Cluster spreadsheet columns based on data category– EUSES spreadsheet corpus “database” section– Hierarchical agglomerative clustering– Manual inspection– Result = 1713 columns in 246 clusters

(1 cluster per data category)

• Created 1 tope for each of 32 most common categories – Yielding 32 topes– Covered 70% of clustered columns

Problem Topes Validation Conclusion

Page 17: Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

1717

We considered 5 validation strategiesWe considered 5 validation strategies

• Strategy 1: Current spreadsheet practice(accept all inputs)

• Strategy 2: Current webapp practice(validate with regexp or fixed list, when available; accept all other inputs)– 36 regexps + 35 fixed lists, in 7 categories

• Strategy 3A: Tope rejecting questionable(accept when (str)=1)

• Strategy 3B: Tope accepting questionable(accept when (str)>0)

• Strategy 4: Tope warn on questionable(simulate double-check by user when 0<(str)<1)

Problem Topes Validation Conclusion

Page 18: Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

1818

MeasurementsMeasurements

• Based on 100 random values per category

• Used F1 to measure accuracy– standard measure of accuracy for

classifiers = (precision*recall)/avg(precision,recall)

• Considered topes with 1, 2, 3, 4, or 5 formats

Problem Topes Validation Conclusion

Page 19: Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

1919

Recognizing multiple formats and Recognizing multiple formats and questionable inputs raises accuracyquestionable inputs raises accuracy

Condition 4: Hypothetical user has to help on ~ 3% of inputs

Condition 1: Recall = 0 (fails to identify any invalid inputs)

Problem Topes Validation Conclusion

Page 20: Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

2020

Topes based on spreadsheet data were Topes based on spreadsheet data were accurate on web application data.accurate on web application data.

Problem Topes Validation Conclusion

Hurricane KatrinaGoogle Base

Page 21: Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

2121

Putting data in a consistent format improves Putting data in a consistent format improves duplicate identification.duplicate identification.

• Randomly extracted 10000 values for each of 5 Hurricane Katrina data categories

• Implemented transformations for each 5-format tope from the less commonly used formats to the most commonly used

• Found approximately 8% more duplicates after transformation

Problem Topes Validation Conclusion

Page 22: Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

2222

Topes improve data validationTopes improve data validation

• Validating with topes improves– Accuracy of validation– Reusability of validation code– Subsequent duplicate identification

• Contributions:– Support for ambiguous data categories– Support for transforming values– Platform-independent validation

Problem Topes Validation Conclusion

Page 23: Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

2323

Future Work: Sharing topesFuture Work: Sharing topes

• Repository search mechanisms based on– Relevance to new applications – Quality criteria

• Integrate with more programming platforms– Microsoft Excel – Microsoft Visual Studio.NET – A simple XML processing API – Univ. Nebraska’s Robofox – IBM’s CoScripter – Your tool or platform ?

Problem Topes Validation Conclusion

Page 24: Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

2424

Thank You…Thank You…

• To Jeff Magee, Betty Cheng, Barbara Ryder, Margaret Burnett, and others at ICSE 2007 for early feedback

• To NSF for funding

• To ICSE 2008 for this opportunity to present

Problem Topes Validation Conclusion