Effective Web Scraping with OXPath

50
DIADEM domain-centric intelligent automated data extraction methodology Eective Web Scraping with http://oxpath.org Giovanni Grasso - Oxford University May 15 th , 2013 @ WWW developer track joint work with Tim Furche, Christian Schallhart, Wednesday, 15 May 13

description

OXPath presentation at WWW 2013 Rio de Janeiro

Transcript of Effective Web Scraping with OXPath

Page 2: Effective Web Scraping with OXPath

OXPath » Lingua Franca for Web Extraction1

A Call for Action in Web Extraction!

Past: Form Filling + HTML Patterns

Now: Interaction + DOM Patterns

getting to the data requires interaction not just form filling

identifying relevant data from rendered DOMs

across several pages

access to all CSS properties (computed style)

2

Wednesday, 15 May 13

Page 3: Effective Web Scraping with OXPath

3

The nesting in the result mirrors the structure of the OXPath ex-pression: extraction markers in a predicate (title, source) repre-sent attributes to the last marker outside the predicate (story).

Kleene Star. Finally, we add the Kleene star, as in [12]. Forexample, the following expression queries Google for “Oxford”,traverses all accessible result pages and extracts all links.

doc("google.com")/descendant::field()[1]/{"Oxford"}/following::field()[1]/{click /}

/( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*

To limit the range of the Kleene star, one can specify upper andlower bounds on the multiplicity, e.g., (...)*{3,8}.

doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{click/}/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{uncheck/}//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"]/{click/}//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>[//div.property-info//a/{click/}//a[.#=’More Facts’]/{click/}//div.home-facts/table:<facts=(.)>]

2. REFERENCES[1] G. O. Arocena and A. O. Mendelzon. WebOQL: Restructuring

documents, databases, and webs. In ICDE, 24–33, 1998.[2] R. Baumgartner, S. Flesca, and G. Gottlob. Visual web

information extraction with Lixto. In VLDB, 119–128, 2001.[3] M. Benedikt and C. Koch. XPath Leashed. CSUR,

41(1):3:1–3:54, 2007.[4] M. Bolin, M. Webber, P. Rha, T. Wilson, and R. C. Miller.

Automation and customization of rendered web pages. InUIST, 163–172, 2005.

[5] C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan. Asurvey of web information extraction systems. TKDE,1411–1428, 2006.

[6] V. Crescenzi and G. Mecca. Automatic information extractionfrom large websites. JACM, 51(5):731âAS779, 2004.

[7] G. Gottlob, C. Koch, and R. Pichler. Efficient algorithms forprocessing XPath queries. TODS, 30(2):444âAS491, 2005.

[8] G. Leshed, E. M. Haber, T. Matthews, and T. Lau. CoScripter:automating & sharing how-to knowledge in the enterprise. InCHI, 1719–1728, 2008.

[9] J. Lin, J. Wong, J. Nichols, A. Cypher, and T. A. Lau. End-user programming of mashups with Vegemite. In IUI, 97–106,2009.

[10] L. Liu, C. Pu, and W. Han. XWRAP: An XML-enabledwrapper construction system for web information sources. InICDE, 611–621, 2000.

[11] M. Liu and T. W. Ling. A rule-based query language forHTML. In DASFAA, 6–13, 2001.

[12] M. Marx. Conditional XPath. TODS, 30(4):929–959, 2005.[13] A. O. Mendelzon, G. A. Mihaila, and T. Milo. Querying the

World Wide Web. In DIS, 80–91, 1996.[14] A. Sahuguet and F. Azavant. Building light-weight wrappers

for legacy web data-sources using W4F. In VLDB, 738–741,1999.

[15] N. Sawa, A. Morishima, S. Sugimoto, and H. Kitagawa.Wraplet: Wrapping your web contents with a lightweightlanguage. In SITIS, 387–394, 2007.

[16] W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan.Declarative information extraction using datalog withembedded extraction predicates. In VLDB, 1033–1044, 2007.

[17] J.-Y. Su, D.-J. Sun, I.-C. Wu, and L.-P. Chen. On design ofbrowser-oriented data extraction system and plug-ins. JMST,18(2):189–200, 2010.

Wednesday, 15 May 13

Page 4: Effective Web Scraping with OXPath

3

Seattle

The nesting in the result mirrors the structure of the OXPath ex-pression: extraction markers in a predicate (title, source) repre-sent attributes to the last marker outside the predicate (story).

Kleene Star. Finally, we add the Kleene star, as in [12]. Forexample, the following expression queries Google for “Oxford”,traverses all accessible result pages and extracts all links.

doc("google.com")/descendant::field()[1]/{"Oxford"}/following::field()[1]/{click /}

/( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*

To limit the range of the Kleene star, one can specify upper andlower bounds on the multiplicity, e.g., (...)*{3,8}.

doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{click/}/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{uncheck/}//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"]/{click/}//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>[//div.property-info//a/{click/}//a[.#=’More Facts’]/{click/}//div.home-facts/table:<facts=(.)>]

2. REFERENCES[1] G. O. Arocena and A. O. Mendelzon. WebOQL: Restructuring

documents, databases, and webs. In ICDE, 24–33, 1998.[2] R. Baumgartner, S. Flesca, and G. Gottlob. Visual web

information extraction with Lixto. In VLDB, 119–128, 2001.[3] M. Benedikt and C. Koch. XPath Leashed. CSUR,

41(1):3:1–3:54, 2007.[4] M. Bolin, M. Webber, P. Rha, T. Wilson, and R. C. Miller.

Automation and customization of rendered web pages. InUIST, 163–172, 2005.

[5] C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan. Asurvey of web information extraction systems. TKDE,1411–1428, 2006.

[6] V. Crescenzi and G. Mecca. Automatic information extractionfrom large websites. JACM, 51(5):731âAS779, 2004.

[7] G. Gottlob, C. Koch, and R. Pichler. Efficient algorithms forprocessing XPath queries. TODS, 30(2):444âAS491, 2005.

[8] G. Leshed, E. M. Haber, T. Matthews, and T. Lau. CoScripter:automating & sharing how-to knowledge in the enterprise. InCHI, 1719–1728, 2008.

[9] J. Lin, J. Wong, J. Nichols, A. Cypher, and T. A. Lau. End-user programming of mashups with Vegemite. In IUI, 97–106,2009.

[10] L. Liu, C. Pu, and W. Han. XWRAP: An XML-enabledwrapper construction system for web information sources. InICDE, 611–621, 2000.

[11] M. Liu and T. W. Ling. A rule-based query language forHTML. In DASFAA, 6–13, 2001.

[12] M. Marx. Conditional XPath. TODS, 30(4):929–959, 2005.[13] A. O. Mendelzon, G. A. Mihaila, and T. Milo. Querying the

World Wide Web. In DIS, 80–91, 1996.[14] A. Sahuguet and F. Azavant. Building light-weight wrappers

for legacy web data-sources using W4F. In VLDB, 738–741,1999.

[15] N. Sawa, A. Morishima, S. Sugimoto, and H. Kitagawa.Wraplet: Wrapping your web contents with a lightweightlanguage. In SITIS, 387–394, 2007.

[16] W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan.Declarative information extraction using datalog withembedded extraction predicates. In VLDB, 1033–1044, 2007.

[17] J.-Y. Su, D.-J. Sun, I.-C. Wu, and L.-P. Chen. On design ofbrowser-oriented data extraction system and plug-ins. JMST,18(2):189–200, 2010.

Wednesday, 15 May 13

Page 5: Effective Web Scraping with OXPath

4

Wednesday, 15 May 13

Page 6: Effective Web Scraping with OXPath

4

The nesting in the result mirrors the structure of the OXPath ex-pression: extraction markers in a predicate (title, source) repre-sent attributes to the last marker outside the predicate (story).

Kleene Star. Finally, we add the Kleene star, as in [12]. Forexample, the following expression queries Google for “Oxford”,traverses all accessible result pages and extracts all links.

doc("google.com")/descendant::field()[1]/{"Oxford"}/following::field()[1]/{click /}

/( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*

To limit the range of the Kleene star, one can specify upper andlower bounds on the multiplicity, e.g., (...)*{3,8}.

doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{click/}/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{uncheck/}//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"]/{click/}//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>

[//div.property-info//a/{click/}//div.home-description:<info=(.)>z

2. REFERENCES[1] G. O. Arocena and A. O. Mendelzon. WebOQL: Restructuring

documents, databases, and webs. In ICDE, 24–33, 1998.[2] R. Baumgartner, S. Flesca, and G. Gottlob. Visual web

information extraction with Lixto. In VLDB, 119–128, 2001.[3] M. Benedikt and C. Koch. XPath Leashed. CSUR,

41(1):3:1–3:54, 2007.[4] M. Bolin, M. Webber, P. Rha, T. Wilson, and R. C. Miller.

Automation and customization of rendered web pages. InUIST, 163–172, 2005.

[5] C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan. Asurvey of web information extraction systems. TKDE,1411–1428, 2006.

[6] V. Crescenzi and G. Mecca. Automatic information extractionfrom large websites. JACM, 51(5):731âAS779, 2004.

[7] G. Gottlob, C. Koch, and R. Pichler. Efficient algorithms forprocessing XPath queries. TODS, 30(2):444âAS491, 2005.

[8] G. Leshed, E. M. Haber, T. Matthews, and T. Lau. CoScripter:automating & sharing how-to knowledge in the enterprise. InCHI, 1719–1728, 2008.

[9] J. Lin, J. Wong, J. Nichols, A. Cypher, and T. A. Lau. End-user programming of mashups with Vegemite. In IUI, 97–106,2009.

[10] L. Liu, C. Pu, and W. Han. XWRAP: An XML-enabledwrapper construction system for web information sources. InICDE, 611–621, 2000.

[11] M. Liu and T. W. Ling. A rule-based query language forHTML. In DASFAA, 6–13, 2001.

[12] M. Marx. Conditional XPath. TODS, 30(4):929–959, 2005.[13] A. O. Mendelzon, G. A. Mihaila, and T. Milo. Querying the

World Wide Web. In DIS, 80–91, 1996.[14] A. Sahuguet and F. Azavant. Building light-weight wrappers

for legacy web data-sources using W4F. In VLDB, 738–741,1999.

[15] N. Sawa, A. Morishima, S. Sugimoto, and H. Kitagawa.Wraplet: Wrapping your web contents with a lightweightlanguage. In SITIS, 387–394, 2007.

[16] W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan.Declarative information extraction using datalog withembedded extraction predicates. In VLDB, 1033–1044, 2007.

[17] J.-Y. Su, D.-J. Sun, I.-C. Wu, and L.-P. Chen. On design ofbrowser-oriented data extraction system and plug-ins. JMST,18(2):189–200, 2010.

Wednesday, 15 May 13

Page 7: Effective Web Scraping with OXPath

4

The nesting in the result mirrors the structure of the OXPath ex-pression: extraction markers in a predicate (title, source) repre-sent attributes to the last marker outside the predicate (story).

Kleene Star. Finally, we add the Kleene star, as in [12]. Forexample, the following expression queries Google for “Oxford”,traverses all accessible result pages and extracts all links.

doc("google.com")/descendant::field()[1]/{"Oxford"}/following::field()[1]/{click /}

/( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*

To limit the range of the Kleene star, one can specify upper andlower bounds on the multiplicity, e.g., (...)*{3,8}.

doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{click/}/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{uncheck/}//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"]/{click/}//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>

[//div.property-info//a/{click/}//div.home-description:<info=(.)>z

2. REFERENCES[1] G. O. Arocena and A. O. Mendelzon. WebOQL: Restructuring

documents, databases, and webs. In ICDE, 24–33, 1998.[2] R. Baumgartner, S. Flesca, and G. Gottlob. Visual web

information extraction with Lixto. In VLDB, 119–128, 2001.[3] M. Benedikt and C. Koch. XPath Leashed. CSUR,

41(1):3:1–3:54, 2007.[4] M. Bolin, M. Webber, P. Rha, T. Wilson, and R. C. Miller.

Automation and customization of rendered web pages. InUIST, 163–172, 2005.

[5] C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan. Asurvey of web information extraction systems. TKDE,1411–1428, 2006.

[6] V. Crescenzi and G. Mecca. Automatic information extractionfrom large websites. JACM, 51(5):731âAS779, 2004.

[7] G. Gottlob, C. Koch, and R. Pichler. Efficient algorithms forprocessing XPath queries. TODS, 30(2):444âAS491, 2005.

[8] G. Leshed, E. M. Haber, T. Matthews, and T. Lau. CoScripter:automating & sharing how-to knowledge in the enterprise. InCHI, 1719–1728, 2008.

[9] J. Lin, J. Wong, J. Nichols, A. Cypher, and T. A. Lau. End-user programming of mashups with Vegemite. In IUI, 97–106,2009.

[10] L. Liu, C. Pu, and W. Han. XWRAP: An XML-enabledwrapper construction system for web information sources. InICDE, 611–621, 2000.

[11] M. Liu and T. W. Ling. A rule-based query language forHTML. In DASFAA, 6–13, 2001.

[12] M. Marx. Conditional XPath. TODS, 30(4):929–959, 2005.[13] A. O. Mendelzon, G. A. Mihaila, and T. Milo. Querying the

World Wide Web. In DIS, 80–91, 1996.[14] A. Sahuguet and F. Azavant. Building light-weight wrappers

for legacy web data-sources using W4F. In VLDB, 738–741,1999.

[15] N. Sawa, A. Morishima, S. Sugimoto, and H. Kitagawa.Wraplet: Wrapping your web contents with a lightweightlanguage. In SITIS, 387–394, 2007.

[16] W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan.Declarative information extraction using datalog withembedded extraction predicates. In VLDB, 1033–1044, 2007.

[17] J.-Y. Su, D.-J. Sun, I.-C. Wu, and L.-P. Chen. On design ofbrowser-oriented data extraction system and plug-ins. JMST,18(2):189–200, 2010.

Wednesday, 15 May 13

Page 8: Effective Web Scraping with OXPath

4

The nesting in the result mirrors the structure of the OXPath ex-pression: extraction markers in a predicate (title, source) repre-sent attributes to the last marker outside the predicate (story).

Kleene Star. Finally, we add the Kleene star, as in [12]. Forexample, the following expression queries Google for “Oxford”,traverses all accessible result pages and extracts all links.

doc("google.com")/descendant::field()[1]/{"Oxford"}/following::field()[1]/{click /}

/( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*

To limit the range of the Kleene star, one can specify upper andlower bounds on the multiplicity, e.g., (...)*{3,8}.

doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{click/}/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{uncheck/}//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"]/{click/}//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>

[//div.property-info//a/{click/}//div.home-description:<info=(.)>z

2. REFERENCES[1] G. O. Arocena and A. O. Mendelzon. WebOQL: Restructuring

documents, databases, and webs. In ICDE, 24–33, 1998.[2] R. Baumgartner, S. Flesca, and G. Gottlob. Visual web

information extraction with Lixto. In VLDB, 119–128, 2001.[3] M. Benedikt and C. Koch. XPath Leashed. CSUR,

41(1):3:1–3:54, 2007.[4] M. Bolin, M. Webber, P. Rha, T. Wilson, and R. C. Miller.

Automation and customization of rendered web pages. InUIST, 163–172, 2005.

[5] C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan. Asurvey of web information extraction systems. TKDE,1411–1428, 2006.

[6] V. Crescenzi and G. Mecca. Automatic information extractionfrom large websites. JACM, 51(5):731âAS779, 2004.

[7] G. Gottlob, C. Koch, and R. Pichler. Efficient algorithms forprocessing XPath queries. TODS, 30(2):444âAS491, 2005.

[8] G. Leshed, E. M. Haber, T. Matthews, and T. Lau. CoScripter:automating & sharing how-to knowledge in the enterprise. InCHI, 1719–1728, 2008.

[9] J. Lin, J. Wong, J. Nichols, A. Cypher, and T. A. Lau. End-user programming of mashups with Vegemite. In IUI, 97–106,2009.

[10] L. Liu, C. Pu, and W. Han. XWRAP: An XML-enabledwrapper construction system for web information sources. InICDE, 611–621, 2000.

[11] M. Liu and T. W. Ling. A rule-based query language forHTML. In DASFAA, 6–13, 2001.

[12] M. Marx. Conditional XPath. TODS, 30(4):929–959, 2005.[13] A. O. Mendelzon, G. A. Mihaila, and T. Milo. Querying the

World Wide Web. In DIS, 80–91, 1996.[14] A. Sahuguet and F. Azavant. Building light-weight wrappers

for legacy web data-sources using W4F. In VLDB, 738–741,1999.

[15] N. Sawa, A. Morishima, S. Sugimoto, and H. Kitagawa.Wraplet: Wrapping your web contents with a lightweightlanguage. In SITIS, 387–394, 2007.

[16] W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan.Declarative information extraction using datalog withembedded extraction predicates. In VLDB, 1033–1044, 2007.

[17] J.-Y. Su, D.-J. Sun, I.-C. Wu, and L.-P. Chen. On design ofbrowser-oriented data extraction system and plug-ins. JMST,18(2):189–200, 2010.

Wednesday, 15 May 13

Page 9: Effective Web Scraping with OXPath

OXPath » Lingua Franca for Web Extraction1

Wrapper Babel

Wrapper induction & data extraction systems

each invent their own wrapper language

or use its own ad-hoc tool or proprietary language

Mainly pattern matching + imperative navigation

mix of XPath & external flow control

limited interaction with complex interfaces

(simple) form filling & submit

focus on automation via visual interfaces

limited extraction language

no multiway navigation

5

Wednesday, 15 May 13

Page 10: Effective Web Scraping with OXPath

1 OXPath » Lingua Franca for Web Extraction

Why OXPath?

6

an XPath for data extraction simplicity

learnable

familiarityscalability

Wednesday, 15 May 13

Page 11: Effective Web Scraping with OXPath

OXPath » The Language2

OXPath = XPath + 4

7

action iteration

extractionstyle

OXPath

Wednesday, 15 May 13

Page 12: Effective Web Scraping with OXPath

8

Wednesday, 15 May 13

Page 13: Effective Web Scraping with OXPath

8

Start at kayak.co.uk:

doc("kayak.co.uk")To select an airport, type a few letters and select from completion list

//field().destination/{"Sea" /} //div#smartbox//li[1]/{click /}This will submit the form

Wednesday, 15 May 13

Page 14: Effective Web Scraping with OXPath

9

Wednesday, 15 May 13

Page 15: Effective Web Scraping with OXPath

9

Refine the results by unchecking the “2+ stops”:

//*#stops2/{uncheck }

Wednesday, 15 May 13

Page 16: Effective Web Scraping with OXPath

9

Refine the results by unchecking the “2+ stops”:

//*#stops2/{uncheck }On all result pages

/(//a[.=‘Next’]/{click /})*

Wednesday, 15 May 13

Page 17: Effective Web Scraping with OXPath

9

Refine the results by unchecking the “2+ stops”:

//*#stops2/{uncheck }On all result pages

/(//a[.=‘Next’]/{click /})*and for each flight

//body.resultrow:<flight>

Wednesday, 15 May 13

Page 18: Effective Web Scraping with OXPath

9

Extract the attributes

Wednesday, 15 May 13

Page 19: Effective Web Scraping with OXPath

9

Extract the attributes

Mouseover the ! to extract flight quality warnings

//span.qualityWarningIcon/{mouseover /}

Wednesday, 15 May 13

Page 20: Effective Web Scraping with OXPath

9

Extract the attributes

Mouseover the ! to extract flight quality warnings

//span.qualityWarningIcon/{mouseover /}Click on the details to extract layovers

Wednesday, 15 May 13

Page 21: Effective Web Scraping with OXPath

OXPath » The Language2

Actions correspond to DOM events, e.g.,

Executed once on each context node

Return context nodes for contextual actions or

root nodes for new DOM absolute actions {click/}

➊ Actions: Browser Interaction

10

Document

Click

Fill

Mouseover

doc("google.com")

{click}

{“Rio”}

{mouseover}

Wednesday, 15 May 13

Page 22: Effective Web Scraping with OXPath

OXPath » The Language2

Extraction marker select nodes for extraction

record markers: :<flight>

attribute markers: :<price=string(.)>

Extracted data has tree shape

nesting of extraction markers in OXPath expression definesnesting of records and attribute-record associations in the output

➋ Extraction: Compact Tree Construction

11

Wednesday, 15 May 13

Page 23: Effective Web Scraping with OXPath

Wednesday, 15 May 13

Page 24: Effective Web Scraping with OXPath

OXPath » The Language2

Most web sites use pagination techniques for results

traversing paginated results require iteration

⇢ extraction from any unbounded component of a link graph

Kleene Star with action in the iterated expression

OXPath’s evaluation algorithm

buffers in practice only a constant number of pages

➌ Iteration: Kleene Star

13

/(//a[.=’Next’]/{click /})*

/(//body/{scroll /})* ( infinite scroll )

Wednesday, 15 May 13

Page 25: Effective Web Scraping with OXPath

OXPath » The Language2

Access to all computed style CSS properties via style axis

➍ Style: Querying Visual Attributes

14

Visibility

Font size

Geometry

Color

style::display or style::visibility

style::font-size

style::top, style::left, ...

style::color or style::background-color

Wednesday, 15 May 13

Page 26: Effective Web Scraping with OXPath

3

Evaluation15

Wednesday, 15 May 13

Page 27: Effective Web Scraping with OXPath

0

50

100

150

200

0 2 4 6 8 10 12 0

20

40

60

80

100

120

140

160m

em

ory

[M

B]

#pages

[1000] / #re

sults

[100,0

00]

time [h]

memoryextracted matches

visited pages

(b) Millions of resultsConstant Memory16

100,000+ pages, millions of results

Wednesday, 15 May 13

Page 28: Effective Web Scraping with OXPath

17

2%

13%

85%

page rendering browser initialization OXPathit’s the browser

Wednesday, 15 May 13

Page 29: Effective Web Scraping with OXPath

0

100

200

300

400

500

600

700

0 20 40 60 80 100 120 140

time w

/o p

age lo

adin

g [se

c]

#pages

OXPathWeb Content Extractor

LixtoVisual Web Ripper

Web HarvestChickenfoot

(b) Norm. evaluation time, <150 p.faster

18

Wednesday, 15 May 13

Page 30: Effective Web Scraping with OXPath

0

200

400

600

800

1000

1200

1400

1600

0 100 200 300 400 500 600 700 800

time

w/o

pa

ge

loa

din

g [

sec]

Number of pages

OXPathLixto

Web HarvestChickenfoot

(c) Norm. evaluation time, <850 p.even faster 19

Wednesday, 15 May 13

Page 31: Effective Web Scraping with OXPath

0

50

100

150

200

250

300

350

0 100 200 300 400 500 600 700 800

me

mo

ry [

MB

]

#pages

OXPathLixto

Web HarvestChickenfoot

memory 20

Wednesday, 15 May 13

Page 32: Effective Web Scraping with OXPath

0

50

100

150

200

250

300

350

0 100 200 300 400 500 600 700 800

me

mo

ry [

MB

]

#pages

OXPathLixto

Web HarvestChickenfoot

memory 20

only hundreds of pages asother tools fail for more pages

Wednesday, 15 May 13

Page 33: Effective Web Scraping with OXPath

OXPath » System & Evaluation3

Evaluation

21

constant memory

very low overhead on XPath

minimal page buffer

browser boundfast

Wednesday, 15 May 13

Page 34: Effective Web Scraping with OXPath

4

OXPathUser stories

22

Wednesday, 15 May 13

Page 35: Effective Web Scraping with OXPath

4

DIADEMUnsupervised Domain-

specific Web Objects Extraction

presented @ World Wide Web 2012 (WWW’12)

23

Wednesday, 15 May 13

Page 36: Effective Web Scraping with OXPath

24

DIADEM data extraction methodologydomain-centric intelligent automated

Wednesday, 15 May 13

Page 37: Effective Web Scraping with OXPath

25

DIADEM data extraction methodologydomain-centric intelligent automated

:=

Wednesday, 15 May 13

Page 38: Effective Web Scraping with OXPath

26

DIADEM data extraction methodologydomain-centric intelligent automated

:=

Wednesday, 15 May 13

Page 39: Effective Web Scraping with OXPath

27

DIADEM data extraction methodologydomain-centric intelligent automated

:=

Wednesday, 15 May 13

Page 40: Effective Web Scraping with OXPath

28

DIADEM data extraction methodologydomain-centric intelligent automated

:=

1

Form Understanding & Filling

Flat Text

Context-drivenblock analysis

3

Energy Performance Chart

Maps

Floor plans

Wednesday, 15 May 13

Page 41: Effective Web Scraping with OXPath

OXPath Wrapper

Cloud extraction

Data integration

4

29

DIADEM data extraction methodologydomain-centric intelligent automated

:=

Single entity (details) pages

Tables

2

Object identification & alignment

Result pages

Flat Text

Context-drivenblock analysis

3

Energy Performance Chart

Maps

Floor plans

Wednesday, 15 May 13

Page 42: Effective Web Scraping with OXPath

Wrapper induction in DIADEM4

30

Induced Wrapper (partial)

doc(‘wwagency.co.uk’)//select#sale_type_id/{0/} //button.formbtn/{click /} (//div.pagenumlinks[last()]//a[last()]/{click /})* //div.proplist_wrap:<RECORD> [.//span.prop_price:<PRICE=string(.)>] [.//ul.prop_keypoints/li[2]/strong:<BEDROOM_ROOMS=string(.)>] [.//div.prop_statuses//text():<PROPERTY_STATUS=string(.)>] [.//strong.orange:<POSTCODE=string(.)>] //div.prop_img/a/{click /}//body [.//div#propertypage_copy/p[last()-1]:<DESCRIPTION=string(.)>] [.//div#print_contact/address/text()[2]:<ADDRESS=string(.)>] [.//a.~'Map view')]/@href:<MAP=string(.)>] [.//div#propertypage_copy/p[2]:<RECEPTION_ROOMS=string(.)>]

Wednesday, 15 May 13

Page 43: Effective Web Scraping with OXPath

4

DEQAQuestion Answering

on the Deep Web

presented @International Semantic Web Conference 2012 (ISWC’12)

31

Wednesday, 15 May 13

Page 44: Effective Web Scraping with OXPath

32

Kindergarden_B

White_Road

1,499,950 £

gr :Offering

rdf:type

dd:hasPrice

Kindergarden_Adbp:near

Domain Specific Triple Store

Question:House near a Kindergarden under 2,000,000 £?

OXPath

OXPath

TBSL

White_Road

Answer:

15

dd:bedrooms

1,499,950 £dd:hasPrice

dbp:near Kindergarden_A

Linking-MetricOXPath

Wednesday, 15 May 13

Page 45: Effective Web Scraping with OXPath

OXPath » DEQA: Question Answering on the Deep Web4

33

RDF Wrapper (partial)doc(‘wwagency.co.uk’)....

.... //div.proplist_wrap:<gr:Offering> [.//span.prop_price:<dd:hasPrice(xsd:double)=string(.)>] ..... [.//strong.orange:<vcard:postal-code=string(.)>]

.... [.//div.prop_img/a/@href:<foaf:page=string(.)>] //div.prop_img/a/{click /}//body [.//div#propertypage_copy/p[last()-1]:<gr:description=string(.)>] [.//a.~'Map view')]/@href:<wgs84:lat=extractLat(.)>] [.//a.~'Map view')]/@href:<wgs84:long=extractLong(.)>]

Wednesday, 15 May 13

Page 46: Effective Web Scraping with OXPath

OXPath » DEQA: Question Answering on the Deep Web4

34

Question translation to SPARQL

Edwardian houses close to supermarket for less than 1,000,000 in Oxfordshire

mapping them to specific restrictions, e.g. cheap could be mapped to costs forflats less than 800 pounds per month.

An example of a successful query is “all houses in Abingdon with more than2 bedrooms”:

SELECT ?y WHERE {2 ?y a <http://diadem.cs.ox.ac.uk/ontologies/real-estate#House> .

?y <http://diadem.cs.ox.ac.uk/ontologies/real-estate#bedrooms> ?y0 .4 ?y <http://www.w3.org/2006/vcard/ns#street-address> ?y1 .

FILTER(?y0 > 2) .6 FILTER(regex(?y1,’Abingdon’,’i’)) .}

In that case, TBSL first performs a restriction by class (“House”), then it findsthe town name “Abingdon” from the street address and it performs a filter on thenumber of rooms. Note that most QA systems would not be sufficiently powerfulto include such filters.

Another example is “Edwardian houses close to supermarket for less than1,000,000 in Oxfordshire”, which was translated to the following query:

SELECT ?x0 WHERE {2 ?x0 <http://dbpedia.org/property/near> ?y2 .

?x0 a <http://diadem.cs.ox.ac.uk/ontologies/real-estate#House> .4 ?v <http://purl.org/goodrelations/v1#includes> ?x0 .

?x0 <http://www.w3.org/2006/vcard/ns#street-address> ?y0 .6 ?v <http://diadem.cs.ox.ac.uk/ontologies/real-estate#hasPrice> ?y1 .

?y2 a <http://linkedgeodata.org/ontology/Supermarket> .8 ?x0 <http://purl.org/goodrelations/v1#description> ?y .

FILTER(regex(?y0,’Oxfordshire’,’i’)) .10 FILTER(regex(?y,’Edwardian ’,’i’)) .

FILTER(?y1 < 1000000) .12 }

In that case, the links to LinkedGeoData were used by selecting the “near” prop-erty as well as by finding the correct class from the LinkedGeoData ontology.

3.2 Performance Evaluation

We conclude this evaluation with a brief look at the system performance, fo-cusing on the resource intensive background extraction and linking, which re-quire several hours compared to seconds for the actual query evaluation. Forthe real-estate scenario, the TBSL algorithm requires 7 seconds on average foranswering a natural language query using a remote triple store as backend. Theperformance is quite stable even for complex queries, which required at most 10seconds. So far, the TBSL system has not been heavily optimised in terms ofperformance, since the research focus was clearly to have a very flexible, robustand accurate algorithm. Performance could be improved, e.g., by using fulltextindices for speeding up NLP tasks and queries.

Wednesday, 15 May 13

Page 47: Effective Web Scraping with OXPath

5

Hands-on

35

Wednesday, 15 May 13

Page 48: Effective Web Scraping with OXPath

5

Version 1.1 available on http://oxpath.org (via code.google)

JAVA

Maven archetype and Command Line Interface with examples

Output in XML, RDF and Relational DB, custom output handler

Based on HTMLUnit

some limitations (e.g., no style axis)

Ongoing work

WebDriver-based implementation, Javascript in the next future

Visual Interface (record-and-play) as Firefox Extension

Any feedback is welcome! Get in touch with me

OXPath Engine

36

Wednesday, 15 May 13

Page 49: Effective Web Scraping with OXPath

Live Demo

37

Wednesday, 15 May 13

Page 50: Effective Web Scraping with OXPath

Questions?

oxpath.org38

Wednesday, 15 May 13