Spreadsheets: The Good Parts

Managing Spreadsheets

Michael CafarellaZhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan PrevoUniversity of MichiganNew England Database SummitFebruary 1, 2013

*Spreadsheets: The Good PartsA Swiss Army Knife for data: storing, sharing, transformingSophisticated users who are not DBAsContain lots of data, found nowhere elseEveryone uses them; almost wholly ignored by DB communityThanks, Jeremy!

*Spreadsheets: The Awful PartsUsers toss in data, worry about schemas later (well, never)Spreadsheets designed for humans, not query processorsNo explicit schemas:Poor data integrity (Zeeberg et al, 2004)Integration very hard

Tumor suppresor gene Deleted In Esophogeal Cancer 1aka, DEC1aka, (according to Excel) 01-DEC

*Spreadsheets: The Awful PartsUsers toss in data, worry about schemas later (well, never)Spreadsheets designed for humans, not query processorsNo explicit schemas:Poor data integrity (Zeeberg et al, 2004)Integration very hard

*A Data TragedySpreadsheets build, then entomb, our best, most expensive, data>400,000 just from ClueWeb09From govts, WTO, many other sources How many inside firewall?Application vision: Ad-hoc integration & analysis for any datasetChallenge: recover relations from any spreadsheet, w/little human effort

*CloseupDesired tuple:

*AgendaSpreadsheets: An OverviewExtracting DataHierarchy ExtractionManual RepairsExperimental ResultsDemoRelated and Future Work

*Extracting Tuples

Extract frame, attribute hierarchy treesMap values to attributes; create tuplesApply manual repairs, repeatHow many repairs for 100% accuracy?Yields tuples, not relationsWe wont discuss: relation assembly

*1. Frame DetectionKey assumption: inputs are data framesLocate metadata in top/left regionsLocate data in center block

*Closeup

*1. Frame DetectionKey assumption: inputs are data framesLocate metadata in top/left regionsLocate data in center block~72% of spreadsheets fit; others not relationalEach non-empty row labeled one of TITLE, HEADER, DATA, FOOTNOTEReconstruct regions from labelsInfer labels with linear-chain Conditional Random Field (Lafferty et al, 2001)Layout features: has bold cell? Merged cell?Text features: contains table, total? Indented text? Numeric cells? Year cells?

*2. Hierarchy Extraction

*Closeup

*2. Hierarchy ExtractionOne task for TOP, one for LEFTCreate boolean random var for each candidate parent relationshipBuild conditional random field to obtain best variable assignment

*2. Hierarchy Extraction

*2. Hierarchy ExtractionCRFs use potential functions to incorporate featuresNode potentials represent single parent/child matchShare style? Near each other? WS-separated?Edge potentials tie pairs of parent/child decisionsShare style pairs? Share text? Indented similiarly?Spreadsheet potentials ensure a legal treeOne-parent potential: - weight for multiple parentsDirectional potential: - weight when parent edges go in opposite directionsRun Loopy Belief Propagation for node + edge; post-inference test and repair for spreadsheetReal sheets yielded 1K-8K variables; inference

*3. Manual RepairUser reviews, repairs extractionGoal: reduce user burdenExtractor makes repeated mistakes, either within spreadsheet or within corpusHeadache for user to repeat fixesOur soln: after each repair, add repair potentials to CRFLinks user-repaired nodes to a set of nodes throughout CRFIncorporates info on node similarityEdges are generated heuristicallyAfter each repair, re-run inference

*ExperimentsGeneral survey of spreadsheet useEvaluate:Standalone extraction accuracyManual repair effectiveness

Test sets:SAUS: 1,322 files from 2010 Statistical Abstract of the United StatesWEB: 410,554 files from 51,252 domains, crawled from ClueWeb09

*Spreadsheets in the Wild

Very common for Web-published govt data

Domain# files% totalbts.gov12,4353.03%census.gov7,8621.91%stat.co.jp6,6331.62%bankofengland.co.uk5,5201.34%ers.usda.gov4,3281.05%agr.gc.ca4,1861.02%wto.org3,8630.94%doh.wa.gov3,5790.87%nsf.gov2,7700.67%nces.ed.gov2,1770.53%

*Spreadsheets in the Wild

*Standalone Extraction100 random H-Sheets from SAUS, WEBThree metricsPairs: parent/child pairs labeled correctly (F1)Tuples: relational tuples labeled correctly (F1)Sheets: % of sheets labeled 100% correctlyTwo methodsBaseline uses just formatting, positionHierarchy uses our approach

*Standalone Extraction

*Manual Repair: EffectivenessGather 10 topic areas from SAUS, WEBExpert provides ground-truth hierarchiesExtract; repeatedly repair and recompute

*Manual Repair: OrderingGood ordering: errors steadily decrease Bad: extended periods of slow decrease

*End-To-End ExtractionWhat is overall utility of our extractor?Final metric: Correct tuples per manual repair

# Tuples# Errors# RepairsTuples/RepairSAUS R50530.765.462.06257.65SAUS Arts454.825.413.134.72SAUS Fin.266.129.913.519.71WEB R50520.2811.383.84135.49WEB BTS65.62.7165.6WEB USDA350.36.81.7206.06

*Demo DetailsRan SAUS corpus through extractorSimple ad hoc integration analysis tool on top of extracted dataEarly version of relation reconstructionEarly version of data ranking, join finding

*Related WorkSpreadsheet as interface (Witkowski et al, 2003), (Liu et al, 2009)Spreadsheet extractionUser-provided rules (Ahmad et al, 2003), (Hung et al, 2011) No explicit user rules (Abraham and Erwig, 2007), (Cunha et al, 2009)Ad hoc integration for found data(Cafarella et al, 2009), (Pimplikar and Sarawagi, 2012), (Yakout et al, 2012)Semi-automatic data programmingWrangler (Guo, et al, 2011)

*Conclusions and Future WorkSpreadsheet extraction opens new datasetsManual repair ensures accuracy, low user burdenOngoing and Future WorkRelation assemblyData relevance rankingJoin finding

*Happy Web guys

*Spreadsheets are everywhereSmart people use them: The Medical Center has more than a dozen CANCER TISSUE SAMPLE DATABASES IN EXCEL!!!!!!!

About 72% of the spreadsheets we downloaded in a big Web crawl have relational data.*Data integrity: Zeeberg et al (2004) found that microarray gene information can be corrupted by Excel. For example, there is a tumor suppressor gene called Deleted In Esophogeal Cancer. Aka, DEC1. When added to Excel, it will be converted to 1-DEC. Also, certain data in the RIKEN clone designation format consists of numbers separated by an E. When brought into Excel, this is converted to floating point notation. As of the papers writing, such data had made it into publicly-available genetic datasets, such as the LocusLink database. Careful users can sidestep these problems, but its a constant struggle. The authors ended up writing their own software to test Excel data for errors.

The problem is that Excel has no idea what data its converting. *Data integrity: Zeeberg et al (2004) found that microarray gene information can be corrupted by Excel. For example, there is a tumor suppressor gene called Deleted In Esophogeal Cancer. Aka, DEC1. When added to Excel, it will be converted to 1-DEC. Also, certain data in the RIKEN clone designation format consists of numbers separated by an E. When brought into Excel, this is converted to floating point notation. As of the papers writing, such data had made it into publicly-available genetic datasets, such as the LocusLink database. Careful users can sidestep these problems, but its a constant struggle. The authors ended up writing their own software to test Excel data for errors.

The problem is that Excel has no idea what data its converting. *Ad hoc == totally contrary to the traditional integration setting. ALSO, no expensive extraction overhead

Any dataset === Spreadsheets from inside the firewall and outside.

Some people have focused on the ad hoc analysis part. E.g., the mashup tool trend from a few years ago. Recent work by Sarawagi and Yakout.But easy extraction is a core part.

I will not talk about data integrity, but its important, and thats another feature you can build if you have good schema information*Spreadsheet 199 from the 2010 Statistical Abstract of the United States. Describes smoking rates among different gender, age, race groupings from 1965-2007.***Out of the interests of time, Ill skip over how the data frame locator operates.*Out of the interests of time, Ill skip over how the data frame locator operates.*Spreadsheet 199 from the 2010 Statistical Abstract of the United States. Describes smoking rates among different gender, age, race groupings from 1965-2007.*Out of the interests of time, Ill skip over how the data frame locator operates.*Out of the interests of time, Ill skip over how the data frame locator operates.*Spreadsheet 199 from the 2010 Statistical Abstract of the United States. Describes smoking rates among different gender, age, race groupings from 1965-2007.*Spreadsheet 199 from the 2010 Statistical Abstract of the United States. Describes smoking rates among different gender, age, race groupings from 1965-2007.*Out of the interests of time, Ill skip over how the data frame locator operates.*Out of the interests of time, Ill skip over how the data frame locator operates.*Out of the interests of time, Ill skip over how the data frame locator operates.

Use max likelihood BFGS-based method for choosing weights; LBP for inferenceCRFs from Lafferty, et al., 2001

****Shows distribution of spreadsheet types among a 200-sheet sample from the web corpus.Purple region: broken ones. Pink: other.

*Based on a sample of 200 sheets from WEB.H-Sheet and N-Sheet indicate a spreadsheet that has relational data: 72.5% of the data.H-Sheet (45% of relational data) has some form of hierarchy to it.

*Depth of SAUS: 2.96 on averageDepth of WEB: 2.61 on average

*LEFT is harder than TOPWe handily beat the standard BASELINE approach.Errors in the dataframe finder account for only a small percentage of overall errors. The one big exception here is the WEB TOP region, which in our experience could be difficult even for a human

*For SAUS,we use govt-provided topic markers. For WEB, we use the domain as a proxy. BTS == Bureau of Transportation Statistics.R10 and R50 and randomly-chosen

Averaged over 100 random repair orderings; error bars indicate standard deviation. The ratio ranges from 0.25 to about 0.6 (for SAUS transport).

Big observations: SAUS does better than WEB. Random does great! And topic grouping does not seem to have a huge impact.

Interestingly, spreadsheets on the same topic do not appear to actually share much metadata in common. Instead, manual repair seems to work because of typographic ties and topic-independent metadata (such as age, race, gender, states, industries, and so on) The only exception to this appeared to be the USDA spreadsheets, which were unusually uniform.

Depth of SAUS: 2.96 on averageDepth of WEB: 2.61 on average

*How does ordering the repair operations impact the completion rate?*How does ordering the repair operations impact the completion rate?***Witkowsky = QueryByExcel project. Excel becomes a front-end to SQL database.Liu = formal algebra translating spreadsheet operations into the database.*Witkowsky = QueryByExcel project. Excel becomes a front-end to SQL database.Liu = formal algebra translating spreadsheet operations into the database.

Spreadsheets: The Good Parts

Documents

Transcript of Spreadsheets: The Good Parts