Data Extraction and Integration from Imprecise Web Sources

Post on 08-Jan-2016

30 views 0 download

Tags:

description

Data Extraction and Integration from Imprecise Web Sources. Lorenzo Blanco , Mirko Bronzi, Valter Crescenzi , Paolo Merialdo , Paolo Papotti Università degli Studi Roma Tre (Creative Commons License , see last slide). Data-intensive websites. Data-intensive websites. target. - PowerPoint PPT Presentation

Transcript of Data Extraction and Integration from Imprecise Web Sources

Data Extraction and Integration from Imprecise Web Sources

Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti

Università degli Studi Roma Tre

(Creative Commons License, see last slide)

Data-intensive websites

Website

Data-intensive websites

Database

Template1

Template2

Template3

target

Flint goal

…StockQuote

Last Min Max

Volume 52high Open

Flint

System architecture

WebSearch[WIDM08]

Data Extraction

Data Integration

The WebThe Web

Novel contribution

• Unsupervised• Automatic• Scalable• No knowledge available

Data Extraction

RoadRunner [Vldb01] ExAlg [Sigmod03]

TurboWrapper [Vldb07]

• Unsupervised• Automatic• Scalable• Uncertain Data• No labels available• No corpus available

Data Integration

WebTables [Vldb08]Cimple [Vldb07]

MetaQuerier [Cidr05]PayGo [Cidr07]

Data Extraction

Data Extraction

Data Extraction

AAPL, GOOG, MSFT, INTC, …

128.09, 439.54, 34.89, 112.37, …

127.81, 439.25, 32.13, 111.01, …

132.43, 443.82, 33.67, 114.32, …

0.50%, -0.38%, 1.23%, 3.92%, -1.65%, …

Add AAPL to Your Portfolio, Add GOOG to Your Portfolio, Add MSFT to Your Portfolio,

Add INTC to Your Portfolio, …

Data ExtractionHTML fragments taken from two pages belonging to the same website:

1,132,228 , 1,735,857/html/body/table/tr[1]/td[2]

$20.66 , $414.58/html/body/table/tr[2]/td[2]

$11.70 , $247.30/html/body/table/tr[3]/td[2]

$20.72 , $414.06/html/body/table/tr[4]/td[2]

Extraction error!

$0.02 , 99,494,200/html/body/table/tr[5]/td[2]

?

4,732,600 , null/html/body/table/tr[6]/td[2]

Data Integration

103316

(max)

42510

(min)

AAGOMS

(stock)

Data Integration

103316

(max)

42510

(min)

AAGOMS

(stock)

t=0.5 t=0.5 t=0.5

Data Integration

103316

(max)

42510

(min)

AAGOMS

(stock)

103316

(max)

42510

(min)

AAGOMS

(stock)

t=0.5 t=0.5 t=0.5

1.0 1.0 1.0

Data Integration

103316

(max)

42510

(min)

AAGOMS

(stock)

103316

(max)

42510

(min)

AAGOMS

(stock)

t=0.5 t=0.5 t=0.5

Data Integration

103316

(max)

42510

(min)

AAGOMS

(stock)

103316

(max)

42510

(min)

AAGOMS

(stock)

t=0.5 t=0.5 t=0.5

62612

(price)

42510

(min)

AAGOMS

(stock)

0.6 1.0 1.0

Data Integration

103316

(max)

42510

(min)

AAGOMS

(stock)

103316

(max)

42510

(min)

AAGOMS

(stock)

t=0.5 t=0.5 t=0.5

62612

(price)

42510

(min)

AAGOMS

(stock)

?

1.0 1.0

Data Integration

103316

(max)

42510

(min)

AAGOMS

(stock)

103316

(max)

42510

(min)

AAGOMS

(stock)

t=0.5 t=0.5

62612

(price)

42510

(min)

AAGOMS

(stock)

1.0

t=0.7 t=0.7

Data Integration

103316

(max)

42510

(min)

AAGOMS

(stock)

103316

(max)

42510

(min)

AAGOMS

(stock)

t=0.5 t=0.5

62612

(price)

42510

(min)

AAGOMS

(stock)

1.0

t=0.7 t=0.7

Data Integration

103316

(max)

42510

(min)

AAGOMS

(stock)

103316

(max)

42510

(min)

AAGOMS

(stock)

t=0.5 t=0.5

62612

(price)

42510

(min)

AAGOMS

(stock)

t=0.7 t=0.7

Wrapper Refinement

103316

(max)

42510

(min)

AAGOMS

(stock)

103316

(max)

42510

(min)

AAGOMS

(stock)

t=0.5 t=0.5

62612

(price)

42510

(min)

AAGOMS

(stock)

10null

10

(min/max)

? ?

0.3 (weak) 0.3 (weak) 0.0 0.0

Wrapper Refinement

matching value

nearby template

tokens

//td[contains(text(),‘Open')]/../td[2]//td[contains(text(),‘Open')]/../../tr[5]/td[1]//td[contains(text(),‘Open')]/../../tr[5]/td[2]//td[contains(text(),‘High')]/../td[2]…

t=0.7 t=0.7

Wrapper Refinement

103316

(max)

42510

(min)

AAGOMS

(stock)

103316

(max)

42510

(min)

AAGOMS

(stock)

t=0.5 t=0.5

62612

(price)

42510

(min)

AAGOMS

(stock)

10null

10

(min/max)

1.0 1.0

103316

(max)

42510

(min)

//td[contains(text(),‘Max')]/../td[2]

//td[contains(text(),‘Min')]/../td[2]

t=0.7 t=0.7

Wrapper Refinement

103316

(max)

42510

(min)

AAGOMS

(stock)

103316

(max)

42510

(min)

AAGOMS

(stock)

t=0.5 t=0.5

62612

(price)

42510

(min)

AAGOMS

(stock)

10null

10

(min/max)

103316

(max)

42510

(min)

Experimental Results(100 websites for each domain)

Soccer domain(45,714 pages)

Attribute |m|

• Name 90• Birth Date 61• Height 54• Nationality 48• Club 43• Position 43• Weight 34• League 14

Videogame domain(49,262 pages)

Attribute |m|

• Title 86• Publisher 59• Developer 45• Genre 28• ESRB rating 40• Release Date 9• Platform 9• # Players 6

Finance domain(57,623 pages)

Attribute |m|

• Stock Symbol 84• Price Change 73• % Change 73• Volume 52• Day Low 43• Day High 41• Last Price 29• Open Price 24

the end!

http://flint.dia.uniroma3.it

License

• This work is licensed under the Creative Commons Attribution-ShareAlike License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/1.0/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.